[Tutor] Help with Parsing HTML files

Charlie Clark Charlie Clark <charlie@begeistert.org>
Thu, 02 Aug 2001 19:19:53 +0200


As part of a prototype I need to be able to plug in several different content 
websites and pull headlines to put them on my own website through the medium 
of a database. I know that the normal way of doing is the subscribing to some 
XML-based format but that isn't possible at the moment as the streams would 
be too expensive (around Euro 5000 per stream per month). We have a couple of 
Visual Basic scripts doing this at the moment but I have suggested the move 
away: the scripts are not easily reusable or extensible; Python would give us 
platform independence and moving from VB + MS SQL to Python + PostgreSQL or 
similar has a certain commercial logic.

Scenario: a wbe page providing content (x articles on the page all in the 
same format) there are no handy comment tags in the source differentiating 
the various parts of interest.

What's the best way to go about parsing the HTML? I've looked at sgmllib and 
htmllib and am a bit lost. The worst thing for me about Python's 
documentation is it's lack of examples. I leafed through all the Python books 
in the bookshop today but failed to find much inspiration. One of the 
problems I'll admit to having is not being able to work out how to use a 
class simply by reading it's code - it just doesn't work for me :-(( 

I see the following alternatives:

1) extend and improve on treating the source as plain-text. Making use of 
regular expressions might be useful here.
2) use a library module to parse the html-source and get it to release the 
appropriate objects

I'd really like to be able to have a system which could easily be trained to 
deal with new source formats on a kind of template basis.

Here's a made up example source

<body>
....
<table>
<tr>
<td><img>Date</td>
<td><font>title</font><br><br>Article</td>
</tr>
.... continues with the rest of the articles

I'm currently analysing the source and working out ways to separate articles 
from each other and then read individual articles. As I'm having to read the 
source into a single string I can see sgmllib and htmllib calling, I just 
don't know what they are saying to me so at the moment it's a question

while string contains articles:
    markers = [list of markers]
    start = string.find(markers[0])
    stop = string.find(markers[-1])
    article = string[start:stop]
    do_something_with_article(article, markers=markers) # pulls out the 
contens and writes them into the database
    string = string[stop:]

Would it be possible to take this source and mark it up as a template which 
would in turn generate markers for automated parsing? So 
<td><img>Date</td> 
would become 
<!-- date_start --><td><img>Date</td><!-- date_end -->
Then a program could learn how to parse a new page based on the template and 
happily go about doing it. This would separate templating from programming 
and be useful in itself.

Would this be a good idea? How would I go about doing this "properly" using 
the modules?

Many thanx for any help and pointers.

Charlie
-- 
Charlie Clark
Helmholtzstr. 20
Düsseldorf
D- 40215
Tel: +49-211-938-5360
GSM: +49-178-463-6199
http://www.begeistert.org