Looking for a decent HTML parser for Python...
hubritic
colinlandrum at gmail.com
Wed Dec 6 11:50:27 EST 2006
Agreed that the web sites are probably broken. Try running the HTML
though HTMLTidy (http://tidy.sourceforge.net/). Doing that has allowed
me to parse where I had problem such as yours.
I have also had luck with BeautifulSoup, which also includes a tidy
function in it.
Just Another Victim of the Ambient Morality wrote:
> "Just Another Victim of the Ambient Morality" <ihatespam at hotmail.com> wrote
> in message news:Gordh.303466$tl2.18227 at fe10.news.easynews.com...
> >
> > Okay, I think I found what I'm looking for in HTMLParser in the
> > HTMLParser module.
>
> Except it appears to be buggy or, at least, not very robust. There are
> websites for which it falsely terminates early in the parsing. I have a
> sneaking feeling the sgml parser will be more robust, if only it had that
> one feature I am looking for.
> Can someone help me out here?
> Thank you...
More information about the Python-list
mailing list