[Tutor] Re: [newbie] sanitizing HTML
Andrei
project5 at redrival.net
Fri Nov 14 14:44:21 EST 2003
Barnaby Scott wrote on Fri, 14 Nov 2003 10:52:02 -0800 (PST):
> I am trying to write a script which will take some
> HTML and go through it stripping out all tags that are
> not expressly permitted in my script. The ones that I
> will permit will generally be the basic harmless ones
> like <p>, <br>, <hr>, <h1...>, <b> etc.
You can subclass sgmllib.SGMLParser. By providing unknown_starttag,
unknown_endtag, handle_entityref and handle_data implementations, you can
"trap" every tag and analyze/modify/delete it.
> I also want to allow some of the more complex ones
> (e.g. <a>, <img>, <body>, <table>) but limit their
> attributes to a permitted subset.
<snip>
The methods I mentioned above receive as parameters (tag, attrs), with
attrs being a tuple (or list, I'm not sure) of attribute-value pairs. You
can do with these attributes whatever you like when you rebuild the data.
> I obviously don't expect someone to hand me all this
> on a plate - unless someone has already done something
> exactly this - but I am a beginner and find the
Actually, I have. Not exactly (I have code which converts URL's to
hyperlinks and changes img tags into links to the images), but it's not
hard to see you could adapt it for your needs. If you're interested,
download the code from http://pears.sf.net. Look at the LinkMaker class in
de pearsengine.py file. It's pretty well documented.
> modules that I probably need rather baffling. Even
> reading the examples I found by searching the archives
> has left me thoroughly confused! I really need a shove
It looks harder than it is really :).
<snip>
> been a little harsh! In particular, I was surprised by
> the number of people who send HTML mail without a
> plain text alternative.)
May they all burn. :)
--
Yours,
Andrei
=====
Mail address in header catches spam. Real contact info (decode with rot13):
cebwrpg5 at jnanqbb.ay. Fcnz-serr! Cyrnfr qb abg hfr va choyvp cbfgf. V ernq
gur yvfg, fb gurer'f ab arrq gb PP.
More information about the Tutor
mailing list