HTMLParser not parsing whole html file
nagle at animats.com
Tue Oct 26 18:39:44 CEST 2010
On 10/24/2010 11:44 PM, Stefan Behnel wrote:
> josh logan, 25.10.2010 04:14:
>> I found the error. The HTML file I'm parsing has invalid HTML at line
>> 193. It has something like:
>> <a href="mystuff "class = "stuff">
>> Note there is no space between the closing quote for the "href" tag
>> and the class attribute. I guess I'll go through each file and correct
>> these issues as I parse them.
> HTMLparser is not made to deal with non-HTML input. You can take a look
> at lxml.html or BeautifulSoup (up to 3.0), which handle these problems a
> lot better.
You might try HTML5lib:
The HTML 5 spec formalizes the concept of "bad HTML". Really. There's
a specified way to parse the most common HTML errors. Most browsers
are far more tolerant of bad HTML than they should be, and not in a
consistent way. The HTML 5 spec tries to fix that.
I use BeautifulSoup, but it's being abandoned for the Python 3
More information about the Python-list