Extracting data from HTML

Geoff Gerrietts geoff at gerrietts.net
Sat Jun 1 00:55:08 EDT 2002


Quoting Ian Bicking (ianb at colorstudy.com):
> On Fri, 2002-05-31 at 14:52, Hazel wrote:
> > how do i write a program that
> > will extract info from an HTML and print
> > of a list of TV programmes, its Time, and Duration
> > using urllib?
> 
> You can get the page with urllib.  You can use htmllib to parse it, but
> I often find that regular expressions (the re module) are an easier way
> -- since you aren't looking for specific markup, but specific
> expressions.  You'll get lots of false negatives (and positives), but
> when you are parsing a page that isn't meant to be parsed (like most web
> pages) no technique is perfect.

Definitely agree with this sentiment.

I'll go a step farther, and do a little compare/contrast.

Once upon a time, I wanted to grab data from the
weatherunderground.com website. I know there are lots of better ways
to go about getting this information, these days, but I was not so
well-informed back then.

So I wanted to grab this information, and I tried using regular
expressions to mangle the page. But truthfully, it was just too hard
to do. I could guess about where in the file the table with all the
info would appear, but getting a regular expression that was inclusive
enough to catch all the quirks, yet exclusionary enough to filter out
all the other embedded tables, proved a very large challenge.

That's when the idea of a parser made a lot of sense.

I could push the whole page through a parser, looking for one
particular phrase in a <TH> element, and from that point forward, map
<TH> elements to <TD> elements effectively. It became a very simple
exercise, because I knew how to find that info.

But as Ian rightly points out, htmllib and a real parser can be very
heavy if you're just looking to grab unformatted info -- or if you
can't rely on the formatting to be reliable.

Both techniques are worth knowing -- but better than either would be
finding a way to get the information you're after via XML-RPC or some
other protocol that's designed to carry data rather than rendering
instructions.

Best of luck,
--G.

-- 
Geoff Gerrietts             "If life were measured by accomplishments, 
<geoff at gerrietts net>     most of us would die in infancy." 
http://www.gerrietts.net/       --A.P. Gouthey





More information about the Python-list mailing list