[Python-Dev] Fixing the XML batteries
Terry Reedy
tjreedy at udel.edu
Sun Dec 11 00:30:34 CET 2011
On 12/10/2011 4:32 PM, Glyph Lefkowitz wrote:
> On Dec 10, 2011, at 2:38 AM, Stefan Behnel wrote:
>
>> Note, however, that html5lib is likely way too big to add it to the
>> stdlib, and that BeautifulSoup lacks a parser for non-conforming HTML
>> in Python 3, which would be the target release series for better HTML
>> support. So, whatever library or API you would want to use for HTML
>> processing is currently only the second question as long as Py3 lacks
>> a real-world HTML parser in the stdlib, as well as a robust character
>> detection mechanism. I don't think that can be fixed all that easily.
>
> Here's the problem in a nutshell, I think:
>
> 1. Everybody wants an HTML parser in the stdlib, because it's
> inconvenient to pull in a dependency for such a "simple" task.
> 2. Everybody wants the stdlib to remain small, stable, and simple and
> not get "overcomplicated".
> 3. Parsing arbitrary HTML5 is a monstrously complex problem, for which
> there exist rapidly-evolving standards and libraries to deal with
> it. Parsing 'the web' (which is rapidly growing to include stuff
> like SVG, MathML etc) is even harder.
>
>
> My personal opinion is that HTML5Lib gets this problem almost completely
> right, and so it should be absorbed by the stdlib.
A little data: the HTML5lib project lives at
https://code.google.com/p/html5lib/
It has 4 owners and 22 other committers.
The most recent release, html5lib 0.90 for Python, is nearly 2 years
old. Since there is a separate Python3 repository, and there is no
mention on Python3 compatibility elsewhere that I saw, including the
pypi listing, I assume that is for Python2 only.
A comment on a recent (July 11) Python3 issue
https://code.google.com/p/html5lib/issues/detail?id=187&colspec=ID%20Type%20Status%20Priority%20Milestone%20Owner%20Summary%20Port
suggest that the Python3 version still has problems. "Merged in now,
though still lots of errors and failures in the testsuite."
--
Terry Jan Reedy
More information about the Python-Dev
mailing list