[Python-Dev] Fixing the XML batteries

Sat Dec 10 21:54:09 CET 2011

Stefan Behnel <stefan_ml at behnel.de> wrote:

> Bill Janssen, 09.12.2011 19:15:
> > I think another thing that might go into "refreshing the batteries" is a
> > feature comparison of BeautifulSoup and HTML5lib against the stdlib
> > competition, to see what needs to be added/revised.  Having to switch to
> > an outside package for parsing possibly invalid HTML is a pain.
> 
> Such a feature request should be worth a separate thread.
> 
> Note, however, that html5lib is likely way too big to add it to the
> stdlib, and that BeautifulSoup lacks a parser for non-conforming HTML
> in Python 3, which would be the target release series for better HTML
> support. So, whatever library or API you would want to use for HTML
> processing is currently only the second question as long as Py3 lacks
> a real-world HTML parser in the stdlib, as well as a robust character
> detection mechanism. I don't think that can be fixed all that easily.

Sounds like it needs a PEP.

I'm only advocating spending some thought on what needs to be done --
whether outside libraries need to be adopted into the stdlib would be a
step after that.  But understanding *why* those libraries exist and are
widely used should be a prerequisite to "refreshing" the stdlib's support.

Bill