Extracting data from an HTML table to a list or text file

Steve Holden sholden at holdenweb.com
Sun Dec 15 14:16:25 EST 2002


"Matthew Hirsch" <meh9 at cornell.edu> wrote ...
> Hi,
>
> What would be the easiest way to extract the following data into a
> list or a text file with columns?
>
> Here is the link to the data:
>
http://co-ops.nos.noaa.gov/tides/get_pred.shtml?stn=6945+Kings+Point&secstn=
Willets+Point&thh=+0&thm=03&tlh=+0&tlm=03&hh=*1.00&hl=*1.04
>
> I tried using htmllib and formatter but the data isn't aligned
> correctly.  Thanks for your help.
>

You're just going to have to be a bit more canny in the way you parse the
data. Some rows have ten elements(days when there are only three tide
changes), most have thirteen, some are irregular (such as the daylight
savings change) and the first row of each table is always headings.

You should also remember that this kind of ugliness is exactly what XML is
supposed to help you avoid, so the reason the attached program is an ugly
hack is because the data you're analyzing is also an ugly hack :-). That
said, it *appears* to work on the URL you gave. You'll have to test it more
thoroughly yourself on other data sets, and this will probably involve more
ugly hacking. No warranties, etc.

It should be relatively easy to modify the output format to give you what
you want. On older versions of Python (before 2.2.2, I believe) you may need
to set True = 1 and False = 0. Happy holidays ...

========================
import htmllib, urllib, formatter, sys

def Usage():
    print """
Usage: python tidetbls.py URL
"""

class myHTMLParser(htmllib.HTMLParser):

    def __init__(self, f):
        htmllib.HTMLParser.__init__(self, f)
        self.tblcount = 0
        self.header = False
        self.doparse = False
        self.copying = False
        self.tblindent = 0

    def start_table(self, attrs):
        self.tblindent += 1
        self.tblcount += 1
        if self.tblcount > 2 and self.tblindent == 2:
            self.header = True
            self.doparse = True

    def end_table(self):
        self.tblindent -= 1
        self.doparse = False
        pass

    def start_tr(self, attrs):
        if not self.doparse:
            return
        self.data = []

    def end_tr(self):
        if not self.doparse:
            return
        if self.header:
            self.header = False # stop ignoring rows
        else:
            print self.results(self.data)

    def start_td(self, attrs):
        if not self.doparse:
            return
        self.copying = True
        self.text = ""

    def end_td(self):
        if not self.doparse:
            return
        self.data.append(self.text)
        self.copying = False

    def handle_data(self, txt):
        if self.copying:
            self.text += txt

    def results(self, data):
        if len(data) < 10: # assume this is extraneous
            return ""
        outlst = []
        date = data.pop(0)
        while data:
            time, type, height = data[0:3]
            data = data[3:]
            outlst.append("%s %s %s %s" % (date, time, type, height))
        return ", ".join(outlst)

def parse(url, formatter):
    f = urllib.urlopen(url)
    data = f.read()
    f.close()
    p = myHTMLParser(formatter)
    p.feed(data)
    p.close()

if len(sys.argv) != 2:
    Usage()
else:
    fmt  = formatter.NullFormatter()
    parse(sys.argv[1], fmt)

========================

regards
-----------------------------------------------------------------------
Steve Holden                                  http://www.holdenweb.com/
Python Web Programming                 http://pydish.holdenweb.com/pwp/
Previous .sig file retired to                    www.homeforoldsigs.com
-----------------------------------------------------------------------






More information about the Python-list mailing list