[python-uk] Favourite ways of scrubbing HTML/whitelisting specific HTML tags?

Thu Feb 7 20:11:41 CET 2008

On 07/02/2008, Jon Ribbens <jon+python-uk at unequivocal.co.uk> wrote:
> On Thu, Feb 07, 2008 at 05:50:37PM +0000, Michael Sparks wrote:
> > > The code at
> > > http://www.voidspace.org.uk/python/weblog/arch_d7_2005_04_23.shtml#e35
> > > is wrong, for example.
> >
> > That's because it whitelists a collection of tags but doesn't whitelist
> > specific attributes, I presume.
>
> That's certainly a big problem, yes. There are other issues, but more
> importantly from my point of view, is that it works in completely the
> wrong way ;-) It uses a lax HTML parser to try and work out what's
> going on with the input, and then strips any 'bad data' that it
> recognises. This will fall apart if the HTML is mangled in such a way
> that the 'tag stripper' parser doesn't understand it, but a web
> browser will. Given all the different versions of all the different
> browsers out there, this approach is doomed to failure.
>
> The correct way to do it would be to strip everything *except* that
> which is 100% recognised to be allowable. i.e. never allow a '<'
> or '&' character through (or any other character, for that matter)
> unless we know precisely what its effect is and that it complies with
> the HTML spec.

Hi,
I have used Beautiful Soup for parsing html.
It works very nicely and I didn't see much of an issue with speed in
parsing several hundred html files every hour or so.
I also rolled my own using various regex's and stuff nicked from a
perl lib. It was awful and feature incomplete. Beautiful Soup worked
better.

Shaun Laughey.