[python-uk] Favourite ways of scrubbing HTML/whitelisting specific HTML tags?
Michael Sparks
ms at cerenity.org
Thu Feb 7 18:50:37 CET 2008
On Thursday 07 February 2008 15:48:46 Jon Ribbens wrote:
> Be aware that if you are doing this for security reasons (e.g. to
> prevent cross-site scripting),
It is for that reason, essentially.
> it is very hard to get right.
Indeed, that's why I thought I'd find out what everyone else actually uses
rather than follow one of the various approaches I could take.
> The code at
> http://www.voidspace.org.uk/python/weblog/arch_d7_2005_04_23.shtml#e35
> is wrong, for example.
That's because it whitelists a collection of tags but doesn't whitelist
specific attributes, I presume.
I can certainly adapt that code to work the way I'd prefer it.
Changing allowed_tags to something like:
allowed_tags = {
'a' : ["id", "name", "href"],
'img' : ["id", "src"],
..
<tag> : [ <list of allowed attributes> ]
}
Would allow that code to be used with only a small modification, if I'm
reading your objection right.
On Thursday 07 February 2008 15:20:17 Michael Foord wrote:
...
> I used htmldata a while ago to do this:
>
> http://www.voidspace.org.uk/python/weblog/arch_d7_2005_04_23.shtml#e35
Much appreciated - I may well start from that approach.
On Thursday 07 February 2008 15:30:57 Alexander Harrowell wrote:
> If you're not bothered about speed, BeautifulSoup can catch, remove and
> replace arbitrary HTML tags in a document.
Initially, speed isn't a issue. OK, so 1 vote in favour of beautiful soup, one
in favour of htmldata & one pointing out a problem with one specific
example...
Michael.
More information about the python-uk
mailing list