[Python-Dev] Fixing the XML batteries

Terry Reedy tjreedy at udel.edu
Sun Dec 11 00:30:34 CET 2011


On 12/10/2011 4:32 PM, Glyph Lefkowitz wrote:
> On Dec 10, 2011, at 2:38 AM, Stefan Behnel wrote:
>
>> Note, however, that html5lib is likely way too big to add it to the
>> stdlib, and that BeautifulSoup lacks a parser for non-conforming HTML
>> in Python 3, which would be the target release series for better HTML
>> support. So, whatever library or API you would want to use for HTML
>> processing is currently only the second question as long as Py3 lacks
>> a real-world HTML parser in the stdlib, as well as a robust character
>> detection mechanism. I don't think that can be fixed all that easily.
>
> Here's the problem in a nutshell, I think:
>
>  1. Everybody wants an HTML parser in the stdlib, because it's
>     inconvenient to pull in a dependency for such a "simple" task.
>  2. Everybody wants the stdlib to remain small, stable, and simple and
>     not get "overcomplicated".
>  3. Parsing arbitrary HTML5 is a monstrously complex problem, for which
>     there exist rapidly-evolving standards and libraries to deal with
>     it. Parsing 'the web' (which is rapidly growing to include stuff
>     like SVG, MathML etc) is even harder.
>
>
> My personal opinion is that HTML5Lib gets this problem almost completely
> right, and so it should be absorbed by the stdlib.

A little data: the HTML5lib project lives at
https://code.google.com/p/html5lib/
It has 4 owners and 22 other committers.

The most recent release, html5lib 0.90 for Python, is nearly 2 years 
old. Since there is a separate Python3 repository, and there is no 
mention on Python3 compatibility elsewhere that I saw, including the 
pypi listing, I assume that is for Python2 only.

A comment on a recent (July 11) Python3 issue
https://code.google.com/p/html5lib/issues/detail?id=187&colspec=ID%20Type%20Status%20Priority%20Milestone%20Owner%20Summary%20Port
suggest that the Python3 version still has problems. "Merged in now, 
though still lots of errors and failures in the testsuite."

-- 
Terry Jan Reedy



More information about the Python-Dev mailing list