[python-uk] Favourite ways of scrubbing HTML/whitelisting specific HTML tags?

Thu Feb 7 19:44:30 CET 2008

On Thu, Feb 07, 2008 at 05:50:37PM +0000, Michael Sparks wrote:
> > The code at
> > http://www.voidspace.org.uk/python/weblog/arch_d7_2005_04_23.shtml#e35
> > is wrong, for example.
> 
> That's because it whitelists a collection of tags but doesn't whitelist 
> specific attributes, I presume.

That's certainly a big problem, yes. There are other issues, but more
importantly from my point of view, is that it works in completely the
wrong way ;-) It uses a lax HTML parser to try and work out what's
going on with the input, and then strips any 'bad data' that it
recognises. This will fall apart if the HTML is mangled in such a way
that the 'tag stripper' parser doesn't understand it, but a web
browser will. Given all the different versions of all the different
browsers out there, this approach is doomed to failure.  

The correct way to do it would be to strip everything *except* that
which is 100% recognised to be allowable. i.e. never allow a '<'
or '&' character through (or any other character, for that matter)
unless we know precisely what its effect is and that it complies with
the HTML spec.