[python-uk] Favourite ways of scrubbing HTML/whitelisting specific HTML tags?

Thu Feb 7 18:50:37 CET 2008

On Thursday 07 February 2008 15:48:46 Jon Ribbens wrote:
> Be aware that if you are doing this for security reasons (e.g. to
> prevent cross-site scripting), 
It is for that reason, essentially. 

> it is very hard to get right.

Indeed, that's why I thought I'd find out what everyone else actually uses 
rather than follow one of the various approaches I could take.

> The code at
> http://www.voidspace.org.uk/python/weblog/arch_d7_2005_04_23.shtml#e35
> is wrong, for example.

That's because it whitelists a collection of tags but doesn't whitelist 
specific attributes, I presume.

I can certainly adapt that code to work the way I'd prefer it.

Changing allowed_tags to something like:
allowed_tags = {
   'a' : ["id", "name", "href"],
   'img' : ["id", "src"],
   ..
   <tag> : [ <list of allowed attributes> ]
}

Would allow that code to be used with only a small modification, if I'm 
reading your objection right.

On Thursday 07 February 2008 15:20:17 Michael Foord wrote:
...
> I used htmldata a while ago to do this:
>
> http://www.voidspace.org.uk/python/weblog/arch_d7_2005_04_23.shtml#e35

Much appreciated - I may well start from that approach.

On Thursday 07 February 2008 15:30:57 Alexander Harrowell wrote:
> If you're not bothered about speed, BeautifulSoup can catch, remove and
> replace arbitrary HTML tags in a document.

Initially, speed isn't a issue. OK, so 1 vote in favour of beautiful soup, one 
in favour of htmldata & one pointing out a problem with one specific 
example...

Michael.