Extracting data from HTML

Paul Boddie paul at boddie.net
Wed Jun 5 07:21:37 EDT 2002

lailian98 at hotmail.com (Hazel) wrote in message news:<82096df4.0206010201.736c4b2c at posting.google.com>...
> I'm relying onsgmllib to do the work....
> since htmllib requires heavy coding.

First of all, take a look at this document:


The first section describes the use of sgmllib, although you could
also try various XML package classes instead. If you understand the
document, you barely need to read the rest of this message.

> Here, an instance of what I want to extract....
> the time of the TV programme >> 12:15:00AM
> "<TR>
>    <TD align=right bgColor=#000033><FONT color=#ffffff 
>    face="verdana, arial, helvetica" size=1>12:15:00 
>    AM</FONT></TD>"
> So what do u think?

The most important structural detail is probably the 'TD' element,
even though the data you need is found within the 'FONT' element. What
you could do, therefore, is to set up a handler method for the 'TD'
element called 'start_td' which sets a flag in your parser object
noting that the information inside the element may be interesting; you
could do more checks on the attributes ('align', 'bgColor') if you
believe that they help to distinguish the cells in the table which
contain times from the other cells. You also need an 'end_td' method
which unsets the flag, and you should always beware of nested tables,

  def start_td(self, attributes):
      if <some test with the attributes>:
        self.inside_cell = 1

  def end_td(self):
      self.inside_cell = 0

Once "inside" the 'TD' element, you might then want to check for a
'FONT' element. Again, set up a handler method called 'start_font'
which firstly checks to see if that flag was set, indicating that the
parser is currently "inside" the 'TD' element of interest. Then, you
might want to check for some interesting attributes, but only if you
think you can rely on them - I get suspicious about multiple
presentational attributes (especially when they could have used a
stylesheet), and that's partly why I advocate checking for the
presence of more than one element type (in this case, the 'TD' and the
'FONT' elements) before mining away at the data.

The 'start_font' element will also set a flag indicating that a time
should be extracted from the text inside the element (between the
start and end tags), and again, it's important to implement an
'end_font' element which unsets this new flag.

  def start_font(self, attributes):
      if self.inside_cell and <some test with the attributes>:
          self.ready_to_read = 1

  def end_font(self):
      if self.inside_cell: # Arguably not necessary.
          self.ready_to_read = 0

Finally, you should implement the 'handle_data' element which checks
that this new flag is set before reading the textual data and storing
it somewhere (such as another attribute in your parser class).

  def handle_data(self, data):
      if self.ready_to_read:

There are lots of issues with doing the parsing this way, and having
parsed some pretty complicated pages, I can certainly recommend the
XML approach instead, since it provides much better ways of testing
the structure than setting flags here and there. Unfortunately, you
may well need something like mxTidy to deal with severely broken HTML,
of which there seems to be a lot around.


More information about the Python-list mailing list