htmllib and parsing data sourrounded by HTML tags

Milos Prudek milos.prudek at tiscali.cz
Sat Jun 8 08:03:17 EDT 2002


I studied htmllib hard and I was able to subclass it and do some
processing of HTML tags. I managed to extend it to process <TD>, for
instance.

But what about the text that is surrounded by HTML tags? Can sgmllib,
htmllib and formatter be leveraged to find data sourrounded by HTML tags
and process them?

An example HTML:
<TABLE>
<TR>
<TD>First Name:</TD><TD>Peter</TD>
<TD>Last Name:</TD><TD>Smith</TD>
</TR>
</TABLE>

The task is to locate second row in this table, and pull string from
second column of this row, i.e. "Smith".

Naturally this can be done without htmllib, but for the sake of
systematic approach I'm interested whether it can be done with htmllib.
It seems that htmllib walks on <HTML> tags but leaves everything that is
not a HTML tag unprocessed and writes it out "as is", and there are not
hooks to attach to. Also it seems that there would have to be a "row
counter" and "column counter" attributes in sgmllib, but I do not feel
smart enough to understand sgmllib enough to subclass it...

Should I forget about htmllib because it is unsuitable for this?

-- 
Milos Prudek







More information about the Python-list mailing list