HTML filtering
Chris Liechti
cliechti at gmx.net
Wed May 1 16:08:33 EDT 2002
"Stuart D. Gathman" <stuart at bmsi.com> wrote in
news:01Xz8.21629$YQ1.8012127 at typhoon.southeast.rr.com:
> I need to filter HTML to remove certain constructs (e.g. <script ...>
> ... </script>). I am trying to use the batteries. The htmllib module
> helps with the parsing, but it seems like a lot of work to create a
> formatter that passes everything (except script) through in HTML
> syntax - espicially trying to preserve original syntax. Am I missing
> something? Is there another module I should be using for filtering
> HTML? Perhaps one of those ad stripping filters written in python
> would provide a usable example?
as your not interested in the document sructure etc you could try a text
replacement approach. e.g. with re's...
script = re.compile("<script>.*?</script>", re.IGNORECASE) #*? is important
filtered = script.sub("<!--script dropped-->", open(filename).read())
open(filename).write(filtered)
or something like that.
chris
--
Chris <cliechti at gmx.net>
More information about the Python-list
mailing list