[Tutor] [newbie] sanitizing HTML

Barnaby Scott barnabydscott at yahoo.com
Fri Nov 14 13:52:02 EST 2003

I am trying to write a script which will take some
HTML and go through it stripping out all tags that are
not expressly permitted in my script. The ones that I
will permit will generally be the basic harmless ones
like <p>, <br>, <hr>, <h1...>, <b> etc.

I also want to allow some of the more complex ones
(e.g. <a>, <img>, <body>, <table>) but limit their
attributes to a permitted subset.

Lastly I want to subvert all URI's that might be
present in an img src, or a body background etc, only
to be allowed to point to files stored locally. (Links
will be permitted though).

I obviously don't expect someone to hand me all this
on a plate - unless someone has already done something
exactly this - but I am a beginner and find the
modules that I probably need rather baffling. Even
reading the examples I found by searching the archives
has left me thoroughly confused! I really need a shove
in the right direction, and if possible some pointers
to lots of examples of the modules in action. 

(Just in case you're wondering, my reason for wanting
it is HTML email. At present I use a script that I
wrote to delete all HTML sections of incoming email on
the mail server. However I feel that this may have
been a little harsh! In particular, I was surprised by
the number of people who send HTML mail without a
plain text alternative.)

Do you Yahoo!?
Protect your identity with Yahoo! Mail AddressGuard

More information about the Tutor mailing list