Looking for code which allows easy extraction of text from HTML

Grzegorz Adam Hankiewicz gradha at titanium.sabren.com
Wed Mar 5 22:31:36 CET 2003


On Wed, Mar 05, 2003 at 05:55:59PM +0000, Joe Francia wrote:
> Use the SGMLParser in sgmllib, as it's slightly easier to use.
> Define a start_<tagname> method for each <tagname> you will parse,
> and handle_data(self, data) is called for all text between tags.

Thanks for the advice, but that's not possible because the data
I want to extract is embedded in several recursions of tables and
their rows/cells, and I want to get those at different levels and
treat them differently. I started a try using a stateful parser but
the number of conditions and variables needed to track down where
was what I wanted was overwhelming.

Another idea I had was parsing the start/end tags and piling them in
a stack, so I could use a re-like function which would be triggered
only when a certain sequence of <table><tr><td>... was matched. But
it looks boring, like the XML solution.

PD: Please don't send me private copies of your public ansers.





More information about the Python-list mailing list