HTML parser example, anybody?

Ulf Goransson ug at algonet.se
Mon May 1 13:43:05 EDT 2000


haering at informatik.tu-muenchen.de wrote:
> 
> I'd rather transform HTML -> HTML. My current
> project uses special comments of the form <!-- at INS keyword param1=value1
> param2=value2 -->. I'd like to parse for comments and filter out + process these. The
> prime reason for a real parser is that I need to rewrite all links in the
> HTML source tree and do some other checks.
> 

This sounds a bit like the script I use to maintain
my own home page. The pages can have pairs of comments like

<!--<HEADER TITLE="ug's other home page">-->
<!--</HEADER>-->

or

<!--<INDEX DIR='links' COMPACT OVERVIEW STRIP=".*: ">-->
<!--</INDEX>-->

or

<!--<FOOTER>-->
<!--</FOOTER>-->

The script will then replace anything between a pair
with whatever it's been programmed to. This way I can
automatically generate directory indexes, replace
the footer in every page in the HTML tree, generate
a table of contents for a page etc.

At the heart of this is SgmlEcho (a subclass of
SGMLParser in sgmllib) which does very little, it just echoes
any input file to the output. I use this as a base
class in some CGI scripts too, it can be found on
http://cgi.algonet.se/htbin/cgiwrap/ug/show.py?script=sgmlecho.py

Then there's HtmlTweaker (subclass of SgmlEcho)
which replaces the handle_comment method. That is
the place to enter the comment parsing code.
(Maybe you can guess from my choice of syntax that
I've made yet another SGMLParser subclass called
CommentParser...) It also does some too complicated
recursive stuff that I myself am not even sure about
what it does anymore... Mainly it's supposed to keep
track of dependencies and re-generate HTML files in the
optimal order.

Additional HTML parsing is easily done by replacing
the unknown_starttag and unknown_endtag methods.
One example of this is
http://www.algonet.se/~ug/html+pycgi/scripts.html#RIPURL
It doesn't use SgmlEcho but the principle is the same.

Hope this helps some...

/ug



More information about the Python-list mailing list