[Python-Dev] Fixing the XML batteries

Terry Reedy tjreedy at udel.edu
Sun Dec 11 00:30:34 CET 2011

On 12/10/2011 4:32 PM, Glyph Lefkowitz wrote:
> On Dec 10, 2011, at 2:38 AM, Stefan Behnel wrote:
>> Note, however, that html5lib is likely way too big to add it to the
>> stdlib, and that BeautifulSoup lacks a parser for non-conforming HTML
>> in Python 3, which would be the target release series for better HTML
>> support. So, whatever library or API you would want to use for HTML
>> processing is currently only the second question as long as Py3 lacks
>> a real-world HTML parser in the stdlib, as well as a robust character
>> detection mechanism. I don't think that can be fixed all that easily.
> Here's the problem in a nutshell, I think:
>  1. Everybody wants an HTML parser in the stdlib, because it's
>     inconvenient to pull in a dependency for such a "simple" task.
>  2. Everybody wants the stdlib to remain small, stable, and simple and
>     not get "overcomplicated".
>  3. Parsing arbitrary HTML5 is a monstrously complex problem, for which
>     there exist rapidly-evolving standards and libraries to deal with
>     it. Parsing 'the web' (which is rapidly growing to include stuff
>     like SVG, MathML etc) is even harder.
> My personal opinion is that HTML5Lib gets this problem almost completely
> right, and so it should be absorbed by the stdlib.

A little data: the HTML5lib project lives at
It has 4 owners and 22 other committers.

The most recent release, html5lib 0.90 for Python, is nearly 2 years 
old. Since there is a separate Python3 repository, and there is no 
mention on Python3 compatibility elsewhere that I saw, including the 
pypi listing, I assume that is for Python2 only.

A comment on a recent (July 11) Python3 issue
suggest that the Python3 version still has problems. "Merged in now, 
though still lots of errors and failures in the testsuite."

Terry Jan Reedy

More information about the Python-Dev mailing list