Extracting data from HTML
geoff at gerrietts.net
Sat Jun 1 06:55:08 CEST 2002
Quoting Ian Bicking (ianb at colorstudy.com):
> On Fri, 2002-05-31 at 14:52, Hazel wrote:
> > how do i write a program that
> > will extract info from an HTML and print
> > of a list of TV programmes, its Time, and Duration
> > using urllib?
> You can get the page with urllib. You can use htmllib to parse it, but
> I often find that regular expressions (the re module) are an easier way
> -- since you aren't looking for specific markup, but specific
> expressions. You'll get lots of false negatives (and positives), but
> when you are parsing a page that isn't meant to be parsed (like most web
> pages) no technique is perfect.
Definitely agree with this sentiment.
I'll go a step farther, and do a little compare/contrast.
Once upon a time, I wanted to grab data from the
weatherunderground.com website. I know there are lots of better ways
to go about getting this information, these days, but I was not so
well-informed back then.
So I wanted to grab this information, and I tried using regular
expressions to mangle the page. But truthfully, it was just too hard
to do. I could guess about where in the file the table with all the
info would appear, but getting a regular expression that was inclusive
enough to catch all the quirks, yet exclusionary enough to filter out
all the other embedded tables, proved a very large challenge.
That's when the idea of a parser made a lot of sense.
I could push the whole page through a parser, looking for one
particular phrase in a <TH> element, and from that point forward, map
<TH> elements to <TD> elements effectively. It became a very simple
exercise, because I knew how to find that info.
But as Ian rightly points out, htmllib and a real parser can be very
heavy if you're just looking to grab unformatted info -- or if you
can't rely on the formatting to be reliable.
Both techniques are worth knowing -- but better than either would be
finding a way to get the information you're after via XML-RPC or some
other protocol that's designed to carry data rather than rendering
Best of luck,
Geoff Gerrietts "If life were measured by accomplishments,
<geoff at gerrietts net> most of us would die in infancy."
http://www.gerrietts.net/ --A.P. Gouthey
More information about the Python-list