Extracting data from an HTML table to a list or text file
Steve Holden
sholden at holdenweb.com
Sun Dec 15 14:16:25 EST 2002
"Matthew Hirsch" <meh9 at cornell.edu> wrote ...
> Hi,
>
> What would be the easiest way to extract the following data into a
> list or a text file with columns?
>
> Here is the link to the data:
>
http://co-ops.nos.noaa.gov/tides/get_pred.shtml?stn=6945+Kings+Point&secstn=
Willets+Point&thh=+0&thm=03&tlh=+0&tlm=03&hh=*1.00&hl=*1.04
>
> I tried using htmllib and formatter but the data isn't aligned
> correctly. Thanks for your help.
>
You're just going to have to be a bit more canny in the way you parse the
data. Some rows have ten elements(days when there are only three tide
changes), most have thirteen, some are irregular (such as the daylight
savings change) and the first row of each table is always headings.
You should also remember that this kind of ugliness is exactly what XML is
supposed to help you avoid, so the reason the attached program is an ugly
hack is because the data you're analyzing is also an ugly hack :-). That
said, it *appears* to work on the URL you gave. You'll have to test it more
thoroughly yourself on other data sets, and this will probably involve more
ugly hacking. No warranties, etc.
It should be relatively easy to modify the output format to give you what
you want. On older versions of Python (before 2.2.2, I believe) you may need
to set True = 1 and False = 0. Happy holidays ...
========================
import htmllib, urllib, formatter, sys
def Usage():
print """
Usage: python tidetbls.py URL
"""
class myHTMLParser(htmllib.HTMLParser):
def __init__(self, f):
htmllib.HTMLParser.__init__(self, f)
self.tblcount = 0
self.header = False
self.doparse = False
self.copying = False
self.tblindent = 0
def start_table(self, attrs):
self.tblindent += 1
self.tblcount += 1
if self.tblcount > 2 and self.tblindent == 2:
self.header = True
self.doparse = True
def end_table(self):
self.tblindent -= 1
self.doparse = False
pass
def start_tr(self, attrs):
if not self.doparse:
return
self.data = []
def end_tr(self):
if not self.doparse:
return
if self.header:
self.header = False # stop ignoring rows
else:
print self.results(self.data)
def start_td(self, attrs):
if not self.doparse:
return
self.copying = True
self.text = ""
def end_td(self):
if not self.doparse:
return
self.data.append(self.text)
self.copying = False
def handle_data(self, txt):
if self.copying:
self.text += txt
def results(self, data):
if len(data) < 10: # assume this is extraneous
return ""
outlst = []
date = data.pop(0)
while data:
time, type, height = data[0:3]
data = data[3:]
outlst.append("%s %s %s %s" % (date, time, type, height))
return ", ".join(outlst)
def parse(url, formatter):
f = urllib.urlopen(url)
data = f.read()
f.close()
p = myHTMLParser(formatter)
p.feed(data)
p.close()
if len(sys.argv) != 2:
Usage()
else:
fmt = formatter.NullFormatter()
parse(sys.argv[1], fmt)
========================
regards
-----------------------------------------------------------------------
Steve Holden http://www.holdenweb.com/
Python Web Programming http://pydish.holdenweb.com/pwp/
Previous .sig file retired to www.homeforoldsigs.com
-----------------------------------------------------------------------
More information about the Python-list
mailing list