HTMLParser not parsing whole html file

Stefan Behnel stefan_ml at behnel.de
Mon Oct 25 08:44:42 CEST 2010


josh logan, 25.10.2010 04:14:
> I found the error. The HTML file I'm parsing has invalid HTML at line
> 193. It has something like:
>
> <a href="mystuff "class = "stuff">
>
> Note there is no space between the closing quote for the "href" tag
> and the class attribute. I guess I'll go through each file and correct
> these issues as I parse them.

HTMLparser is not made to deal with non-HTML input. You can take a look at 
lxml.html or BeautifulSoup (up to 3.0), which handle these problems a lot 
better.

Stefan




More information about the Python-list mailing list