Extracting data from HTML

Ian Bicking ianb at colorstudy.com
Fri May 31 18:25:52 EDT 2002


On Fri, 2002-05-31 at 14:52, Hazel wrote:
> how do i write a program that
> will extract info from an HTML and print
> of a list of TV programmes, its Time, and Duration
> using urllib?

You can get the page with urllib.  You can use htmllib to parse it, but
I often find that regular expressions (the re module) are an easier way
-- since you aren't looking for specific markup, but specific
expressions.  You'll get lots of false negatives (and positives), but
when you are parsing a page that isn't meant to be parsed (like most web
pages) no technique is perfect.

  Ian







More information about the Python-list mailing list