HTMLParser problems.

Peter Otten __peter__ at web.de
Fri Oct 31 14:30:52 EST 2003


Sean Cody wrote:

> I'm trying to take a webpage that has a nxn table of entries (bus times)
> and
> convert it to a 2D array (list of lists).  Initially this was simple but I
> need to be able to access whole 'columns' of data so the 2D array cannot
> be sparse but in the HTML file I'm parsing there can be sparse entries
> which
> are repsented in the table as &nbsp entities.  The sparse output breaks my
> ability to use entire columns and have entries correspond properly.
> 
> Is there a simple way to tell the parser whenever you see a &nbsp in table
> data return say... "-1" or "NaN"?
> The HTMLParser documentation is a bit.... terse.  I was considering using
> the handle_entityref() method but I would assume the data has already been
> parsed at that point.
> 
> I could try:
>         def handle_entityref(self,entity):
>                 if self.in_td == 1:
>                     if entity == "nbsp":
>                         self.row.append(-1)
> 
> But that seems ulgy... (comments?).
> 
> As an example here is some code I'm using and partial output:

[...]

> parser.feed(socket.read())

The simplest solution is to replace the above line with

parser.feed(socket.read().replace(" ", "NaN")

Below is an only slightly more robust solution. It implements a rudimentary
"what table are we in?" check and can handle table cells with multiple data
chunks.

import htmllib,os,string,urllib
from HTMLParser import HTMLParser

class foo(HTMLParser):
    def __init__(self):
        self.matrix = []
        self.row = None
        self.cell = None
        self.in_table = 0
        self.empty = "NaN"
        self.reset()

    def handle_starttag(self,tag,attrs):
        if tag == "table":
            self.in_table += 1
        elif self.in_table == 2:
            if tag == "td":
                assert self.cell is None
                self.cell = []
            elif tag == "tr":
                self.row = []
                self.matrix.append(self.row)

    def handle_data(self,data):
        if self.in_table == 2:
            if self.cell is not None:
                data = string.strip(data)
                if data or True:
                    self.cell.append(data)

    def handle_endtag(self,tag):
        if tag == "table":
            self.in_table -= 1
        elif self.in_table == 2:
            if tag == "td":
                s = " ".join(self.cell).replace("\n", " ")
                if s == "":
                    s = self.empty
                self.row.append(s)
                self.cell = None
            elif tag == "tr":
                self.row = None

parser = foo()
if 0:
    instream = urllib.urlopen(
       
"http://winnipegtransit.com/TIMETABLE/TODAY/STOPS/105413bottom.html")
else:
    instream = file("105413bottom.html")
data = instream.read()
parser.feed(data)
instream.close()
parser.close()
for row in parser.matrix:
    assert len(row) == 4
    print row

I've replaced the urlopen() call with access to a local file as you might
want to run your tests with a local copy of the time table, too.

Peter




More information about the Python-list mailing list