parsing complex web pages
Skip Montanaro
skip at pobox.com
Wed Jun 18 21:20:06 EDT 2003
John> I wrote some code to parse HTML tables and forms into a
John> specialised object model useful for web testing and scraping (the
John> tables code is 'very alpha').
I have something similar which I use quite successfully in the Musi-Cal Gig
Gopher (a concert calendar scraper). The Gig Gopher runs a pipeline of
commands to massage a page into a form more readily amenable to pattern
matching using regular expressions. The table filter does nothing more than
make the format more uniform. If the input is
<table WIDTH="95%" >
<caption><tbody>
<br></tbody></caption>
<tr>
<td BGCOLOR="#DDDDEE">Saturday, May 3 @ 9 PM </td>
<td BGCOLOR="#DDDDEE"><b>Jennifer Greer, Lise Winne, Karen Jacobsen, Melanie
Krahmer, and Corley Roberts</b></td>
<td BGCOLOR="#DDDDEE">$10/8</td>
</tr>
...
the output is
<td>Saturday, May 3 @ 9 PM<td>Jennifer Greer, Lise Winne, Karen Jacobsen, Melanie Krahmer, and Corley Roberts<td>$10/8
...
that is, one table row per output line, only <td> tags as field introducers
and all attributes stripped. I also have a table expander which expands
cells with colspan and rowspan attributes giving a uniformly rectangular
table.
Skip
More information about the Python-list
mailing list