The evolution of HTMLParser
Hi, during the last few months/years I worked on HTMLParser and the html package -- this is a brief summary of what happened, what I'm working on for 3.4, and what are the plans for future, and I'm looking for some feedback about the latter.
When I first looked at it, the parser was able to parse most valid pages, and had some heuristic to work around common mistakes, but was raising exceptions on most broken pages. On Python 3.2 it also had a new strict argument that when set to False would allow the parser to try to handle some more broken markup.
Last year HTML5 became a "Candidate Recommendation", and a lengthy specification with detailed algorithms to handle both valid and invalid markup was released 0. Since then I worked on converging HTMLParser to the HTML5 standard, while trying to remain backward compatible and, where necessary, provide warnings for the changes I was making. Since the HTML5 standard specifies how to handle broken markup and since the strict mode of HTMLParser is not strict enough to be used to validate markup, I decided to deprecate it in 3.3 and remove it in 3.5 [2]. Currently the parser is able to handle horribly broken markup and (in theory) should never raise errors while parsing HTML. The result it produces is really close to what the standard says and what the browser does (I intentionally ignore a few obscure corner cases to keep the code relatively simple/fast). This is true for both 2.7 and 3.x (you can try to break it and report any failures you might encounter). Python 3.3 also comes with the list of HTML 5 entities (html.entities.html5), and 3.4 will have an html.unescape() function to convert them to the corresponding Unicode characters.
Now I'm working on #13633 (Automatically convert character references in HTMLParser [1]), and I'm planning to add a convert_charrefs boolean flag to the constructors that, when set to True, will automatically convert charrefs (e.g. """, """) to the corresponding Unicode characters, and avoid calling the handle_charref/handle_entityref methods. Since in my opinion this behavior is preferable, I am thinking about switching the default to True in 3.5/3.6 and add a warning to 3.4 that warns the user to either set convert_charrefs explicitly or be ready to a behavior change in 3.5/3.6. This means that HTMLParser will see the warning and will have to set the flag, and they will be able to remove it in 3.5/3.6, when the default will be True and warning will be gone.
Do you think this would be acceptable? If not, can you think any better way to do it?
After this, most of the work on HTMLParser and the html package should be done. I plan to update the documentation and say that the parser is (almost) compliant with HTML 5, phase out the deprecated "strict" argument [2] and eventually the warning about convert_charrefs, and possibly do some optimizations/clean ups. There's also an open issue to add a generator-based API [3], but that's a major change and I need more time to think about it.
Best Regards, Ezio Melotti
[1]: http://bugs.python.org/issue13633 - Automatically convert character references in HTMLParser [2]: http://bugs.python.org/issue15114 - Deprecate strict mode of HTMLParser [3]: http://bugs.python.org/issue17410 - Generator-based HTMLParser
On mer., 2013-11-20 at 21:57 +0200, Ezio Melotti wrote:
Now I'm working on #13633 (Automatically convert character references in HTMLParser [1]), and I'm planning to add a convert_charrefs boolean flag to the constructors that, when set to True, will automatically convert charrefs (e.g. """, """) to the corresponding Unicode characters, and avoid calling the handle_charref/handle_entityref methods.
How about a separate StandardHTMLParser class that would have the right handle_charref / handle_entityref implementations? (you could also change other behaviours in that class if desired)
Regards
Antoine.
On Wed, Nov 20, 2013 at 10:34 PM, Antoine Pitrou <antoine@python.org> wrote:
On mer., 2013-11-20 at 21:57 +0200, Ezio Melotti wrote:
Now I'm working on #13633 (Automatically convert character references in HTMLParser [1]), and I'm planning to add a convert_charrefs boolean flag to the constructors that, when set to True, will automatically convert charrefs (e.g. """, """) to the corresponding Unicode characters, and avoid calling the handle_charref/handle_entityref methods.
How about a separate StandardHTMLParser class that would have the right handle_charref / handle_entityref implementations? (you could also change other behaviours in that class if desired)
When convert_charrefs is True, handle_charref/handle_entityref are not called at all. This is in part because there's no easy way to tell where an invalid charrefs ends (e.g. if the ';' is missing), so the parser would either have to only find correctly terminated charrefs (but that doesn't allow to handle invalid HTML5 entities) or it will have to apply the HTML5 algorithm (or a subset of it) to find what might be a charref, and then the user will have to do it again to find the corresponding character.
So, for example, passing "<p>foo>bar</p>" to the parser currently results in:
- a call to handle_starttag with "p";
- a call to handle_data with "foo";
- a call to handle_entitydef with ">";
- a call to handle_data with "bar";
- a call to handle_endtag with "p"; The user has then to write the code in handle_entitydef to convert ">" to ">" and then do "foo" + ">" + "bar" before getting the content of the paragraph, i.e. "foo>bar".
With the proposed patch, the parser gets all the text between tags, and then passes it to html.unescape() to convert all the charrefs according to the HTML5 algorithm, so the example above becomes:
- a call to handle_starttag with "p";
- a call to handle_data with "foo>bar";
- a call to handle_endtag with "p";
This also happens in the core of HTMLParser, so in order to create a subclass where charrefs are converted automatically and without the handle_charref/handle_entityref, I would also have to duplicate (or reorganize) lot of code.
Also while parsing arbitrary HTML you might or might not get charrefs, so the only use cases left are I can think of are: parsing (the output might be different though);
- preserving the entities -- this can be done by setting convert_charrefs=False and returning what gets passed to handle_charref/handle_entityref or by using html.escape() after the
- using a different set of charrefs -- xml and html4 are subsets of the html5 charrefs so they are covered, for other sets it's still possible to keep using convert_charrefs=False (and people will have time till 3.5/3.6 to add it before the default changes); (Note that unlike the strict argument/mode, I don't plan to remove convert_charrefs -- only to make it default to True.)
Best Regards, Ezio Melotti
Regards
Antoine.
participants (2)
-
Antoine Pitrou
-
Ezio Melotti