Would anyone show me how to use htmllib?

Alex Martelli aleaxit at yahoo.com
Tue Oct 31 05:24:31 EST 2000


<jackxh at my-deja.com> wrote in message news:8tln3f$sf$1 at nnrp1.deja.com...
> Thank you for the example.
> I went back and take a look htmllib again. Some part makes more sense
> now. Here is what I wanted to do:
>
> I noticed that are lots of patterns in html pages, I want to extract
> infomation out of html pages(based on patterns). I have done this using
> perl's regular expression before. Now I am wondering if I can speed up
> development process and have a stardard approach for this problem using
> python htmllib.

Absolutely yes.  Particularly because HTML syntax is NOT parsable
by regular-expressions (either Perl's or Python's -- they're quite
close); you can get, say, 80% of the way there with an amount X
of effort, then each halving of the remaining percentage of "cases
not well treated" doubles the overall effort.  It's a no-win
strategy.


> For reference, htmllib library documenation metioned:
> ######################################################################
> #This module defines a class which can serve as a base for parsing text
> #files formatted in the HyperText Mark-up Language (HTML).
> ######################################################################
>
> All of the examples I have seen are extracting URL links from a html
> page. I was wondering if I can do more with this modules.

You have to inherit from HTMLParser, and override some methods, if
you want to do more than extracting links (or simple output formatting),
because that is what HTMLParser itself does.  Sometimes it's handier
to use sgmllib rather than htmllib, actually -- sgmllib is "more
primitive" (htmllib's parser inherits from sgmllib's), but that IS
handy at times.

For an example of htmllib use, see, e.g., my post:
http://www.deja.com/getdoc.xp?AN=661888820
"converting an html table to a tree", and its thread.


Alex






More information about the Python-list mailing list