HTML filtering

Chris Liechti cliechti at gmx.net
Wed May 1 16:08:33 EDT 2002


"Stuart D. Gathman" <stuart at bmsi.com> wrote in
news:01Xz8.21629$YQ1.8012127 at typhoon.southeast.rr.com: 

> I need to filter HTML to remove certain constructs (e.g. <script ...>
> ... </script>).  I am trying to use the batteries.  The htmllib module
> helps with the parsing, but it seems like a lot of work to create a
> formatter that passes everything (except script) through in HTML
> syntax - espicially trying to preserve original syntax.  Am I missing
> something?  Is there another module I should be using for filtering
> HTML?  Perhaps one of those ad stripping filters written in python
> would provide a usable example?

as your not interested in the document sructure etc you could try a text 
replacement approach. e.g. with re's...

script = re.compile("<script>.*?</script>", re.IGNORECASE) #*? is important
filtered = script.sub("<!--script dropped-->", open(filename).read())
open(filename).write(filtered)

or something like that.

chris

-- 
Chris <cliechti at gmx.net>




More information about the Python-list mailing list