[Tutor] Help with Parsing HTML files
Charlie Clark
Charlie Clark <charlie@begeistert.org>
Thu, 02 Aug 2001 19:19:53 +0200
As part of a prototype I need to be able to plug in several different content
websites and pull headlines to put them on my own website through the medium
of a database. I know that the normal way of doing is the subscribing to some
XML-based format but that isn't possible at the moment as the streams would
be too expensive (around Euro 5000 per stream per month). We have a couple of
Visual Basic scripts doing this at the moment but I have suggested the move
away: the scripts are not easily reusable or extensible; Python would give us
platform independence and moving from VB + MS SQL to Python + PostgreSQL or
similar has a certain commercial logic.
Scenario: a wbe page providing content (x articles on the page all in the
same format) there are no handy comment tags in the source differentiating
the various parts of interest.
What's the best way to go about parsing the HTML? I've looked at sgmllib and
htmllib and am a bit lost. The worst thing for me about Python's
documentation is it's lack of examples. I leafed through all the Python books
in the bookshop today but failed to find much inspiration. One of the
problems I'll admit to having is not being able to work out how to use a
class simply by reading it's code - it just doesn't work for me :-((
I see the following alternatives:
1) extend and improve on treating the source as plain-text. Making use of
regular expressions might be useful here.
2) use a library module to parse the html-source and get it to release the
appropriate objects
I'd really like to be able to have a system which could easily be trained to
deal with new source formats on a kind of template basis.
Here's a made up example source
<body>
....
<table>
<tr>
<td><img>Date</td>
<td><font>title</font><br><br>Article</td>
</tr>
.... continues with the rest of the articles
I'm currently analysing the source and working out ways to separate articles
from each other and then read individual articles. As I'm having to read the
source into a single string I can see sgmllib and htmllib calling, I just
don't know what they are saying to me so at the moment it's a question
while string contains articles:
markers = [list of markers]
start = string.find(markers[0])
stop = string.find(markers[-1])
article = string[start:stop]
do_something_with_article(article, markers=markers) # pulls out the
contens and writes them into the database
string = string[stop:]
Would it be possible to take this source and mark it up as a template which
would in turn generate markers for automated parsing? So
<td><img>Date</td>
would become
<!-- date_start --><td><img>Date</td><!-- date_end -->
Then a program could learn how to parse a new page based on the template and
happily go about doing it. This would separate templating from programming
and be useful in itself.
Would this be a good idea? How would I go about doing this "properly" using
the modules?
Many thanx for any help and pointers.
Charlie
--
Charlie Clark
Helmholtzstr. 20
Düsseldorf
D- 40215
Tel: +49-211-938-5360
GSM: +49-178-463-6199
http://www.begeistert.org