[issue14538] HTMLParser: parsing error

Thu Apr 12 17:26:45 CEST 2012

R. David Murray <rdmurray at bitdance.com> added the comment:

Yes, after considerable discussion those of working on this stuff decided that the goal should be that the parser be able to complete parsing, without error, anything the typical browsers can parse (which means, pretty much anything, though that says nothing about whether the result of the parse is useful in any way).  In other words, we've been treating it as a bug when the parser throws an error, since one generally uses the library to parse web pages from the internet and having the parse fail leaves you SOL for doing anything useful with the bad pages one gets therefrom.  (Note that if the parser was doing strict adherence to the older RFCs our decision would have been different...but it is not.  It has always accepted *some* badly formed documents, and rejected others.)

Also note that BeautifulSoup in Python2 used the sgml parser, which didn't throw errors, but that is gone in Python3.  In Python3 BeautifulSoup uses the html parser...which is what started us down this road to begin with.

----------
nosy: +r.david.murray

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue14538>
_______________________________________