Would anyone show me how to use htmllib?

jackxh at my-deja.com jackxh at my-deja.com
Thu Nov 2 03:53:08 EST 2000


I went through your link. It seems to me in order for you can only
process the HTML TAGs by define start_"TAG NAME" function. This feature
is limited. A lot of times, the meaningful stuff is in the content of
the html.
I don't know if my thought is right or not?
Jack Xie

In article <8tm6uo01m29 at news1.newsguy.com>,
  "Alex Martelli" <aleaxit at yahoo.com> wrote:
> <jackxh at my-deja.com> wrote in message
news:8tln3f$sf$1 at nnrp1.deja.com...
> > Thank you for the example.
> > I went back and take a look htmllib again. Some part makes more
sense
> > now. Here is what I wanted to do:
> >
> > I noticed that are lots of patterns in html pages, I want to extract
> > infomation out of html pages(based on patterns). I have done this
using
> > perl's regular expression before. Now I am wondering if I can speed
up
> > development process and have a stardard approach for this problem
using
> > python htmllib.
>
> Absolutely yes.  Particularly because HTML syntax is NOT parsable
> by regular-expressions (either Perl's or Python's -- they're quite
> close); you can get, say, 80% of the way there with an amount X
> of effort, then each halving of the remaining percentage of "cases
> not well treated" doubles the overall effort.  It's a no-win
> strategy.
>
> > For reference, htmllib library documenation metioned:
> >
######################################################################
> > #This module defines a class which can serve as a base for parsing
text
> > #files formatted in the HyperText Mark-up Language (HTML).
> >
######################################################################
> >
> > All of the examples I have seen are extracting URL links from a html
> > page. I was wondering if I can do more with this modules.
>
> You have to inherit from HTMLParser, and override some methods, if
> you want to do more than extracting links (or simple output
formatting),
> because that is what HTMLParser itself does.  Sometimes it's handier
> to use sgmllib rather than htmllib, actually -- sgmllib is "more
> primitive" (htmllib's parser inherits from sgmllib's), but that IS
> handy at times.
>
> For an example of htmllib use, see, e.g., my post:
> http://www.deja.com/getdoc.xp?AN=661888820
> "converting an html table to a tree", and its thread.
>
> Alex
>
>


Sent via Deja.com http://www.deja.com/
Before you buy.



More information about the Python-list mailing list