Looking for a decent HTML parser for Python...

hubritic colinlandrum at gmail.com
Wed Dec 6 11:50:27 EST 2006


Agreed that the web sites are probably broken.  Try running the HTML
though HTMLTidy (http://tidy.sourceforge.net/). Doing that has allowed
me to parse where I had problem such as yours.

I have also had luck with BeautifulSoup, which also includes a tidy
function in it.



Just Another Victim of the Ambient Morality wrote:
> "Just Another Victim of the Ambient Morality" <ihatespam at hotmail.com> wrote
> in message news:Gordh.303466$tl2.18227 at fe10.news.easynews.com...
> >
> >    Okay, I think I found what I'm looking for in HTMLParser in the
> > HTMLParser module.
>
>     Except it appears to be buggy or, at least, not very robust.  There are
> websites for which it falsely terminates early in the parsing.  I have a
> sneaking feeling the sgml parser will be more robust, if only it had that
> one feature I am looking for.
>     Can someone help me out here?
>     Thank you...




More information about the Python-list mailing list