Hi, during the last few months/years I worked on HTMLParser and the html package -- this is a brief summary of what happened, what I'm working on for 3.4, and what are the plans for future, and I'm looking for some feedback about the latter.
When I first looked at it, the parser was able to parse most valid pages, and had some heuristic to work around common mistakes, but was raising exceptions on most broken pages. On Python 3.2 it also had a new strict argument that when set to False would allow the parser to try to handle some more broken markup.
Last year HTML5 became a "Candidate Recommendation", and a lengthy specification with detailed algorithms to handle both valid and invalid markup was released 0. Since then I worked on converging HTMLParser to the HTML5 standard, while trying to remain backward compatible and, where necessary, provide warnings for the changes I was making. Since the HTML5 standard specifies how to handle broken markup and since the strict mode of HTMLParser is not strict enough to be used to validate markup, I decided to deprecate it in 3.3 and remove it in 3.5 [2]. Currently the parser is able to handle horribly broken markup and (in theory) should never raise errors while parsing HTML. The result it produces is really close to what the standard says and what the browser does (I intentionally ignore a few obscure corner cases to keep the code relatively simple/fast). This is true for both 2.7 and 3.x (you can try to break it and report any failures you might encounter). Python 3.3 also comes with the list of HTML 5 entities (html.entities.html5), and 3.4 will have an html.unescape() function to convert them to the corresponding Unicode characters.
Now I'm working on #13633 (Automatically convert character references in HTMLParser [1]), and I'm planning to add a convert_charrefs boolean flag to the constructors that, when set to True, will automatically convert charrefs (e.g. """, """) to the corresponding Unicode characters, and avoid calling the handle_charref/handle_entityref methods. Since in my opinion this behavior is preferable, I am thinking about switching the default to True in 3.5/3.6 and add a warning to 3.4 that warns the user to either set convert_charrefs explicitly or be ready to a behavior change in 3.5/3.6. This means that HTMLParser will see the warning and will have to set the flag, and they will be able to remove it in 3.5/3.6, when the default will be True and warning will be gone.
Do you think this would be acceptable? If not, can you think any better way to do it?
After this, most of the work on HTMLParser and the html package should be done. I plan to update the documentation and say that the parser is (almost) compliant with HTML 5, phase out the deprecated "strict" argument [2] and eventually the warning about convert_charrefs, and possibly do some optimizations/clean ups. There's also an open issue to add a generator-based API [3], but that's a major change and I need more time to think about it.
Best Regards, Ezio Melotti
[1]: http://bugs.python.org/issue13633 - Automatically convert character references in HTMLParser [2]: http://bugs.python.org/issue15114 - Deprecate strict mode of HTMLParser [3]: http://bugs.python.org/issue17410 - Generator-based HTMLParser