parsing complex web pages

Wed Jun 18 21:20:06 EDT 2003

    John> I wrote some code to parse HTML tables and forms into a
    John> specialised object model useful for web testing and scraping (the
    John> tables code is 'very alpha').

I have something similar which I use quite successfully in the Musi-Cal Gig
Gopher (a concert calendar scraper).  The Gig Gopher runs a pipeline of
commands to massage a page into a form more readily amenable to pattern
matching using regular expressions.  The table filter does nothing more than
make the format more uniform.  If the input is

    <table WIDTH="95%" >
    <caption><tbody>
    <br></tbody></caption>

    <tr>
    <td BGCOLOR="#DDDDEE">Saturday, May 3   @  9 PM </td>

    <td BGCOLOR="#DDDDEE"><b>Jennifer Greer, Lise Winne, Karen Jacobsen, Melanie
    Krahmer, and Corley Roberts</b></td>

    <td BGCOLOR="#DDDDEE">$10/8</td>
    </tr>
    ...

the output is

    <td>Saturday, May 3 @ 9 PM<td>Jennifer Greer, Lise Winne, Karen Jacobsen, Melanie Krahmer, and Corley Roberts<td>$10/8
    ...

that is, one table row per output line, only <td> tags as field introducers
and all attributes stripped.  I also have a table expander which expands
cells with colspan and rowspan attributes giving a uniformly rectangular
table.

Skip