[Python-Dev] html5lib/BeautifulSoup (was: Integrate lxml into the stdlib? (was: Integrate BeautifulSoup into stdlib?))

Fri Mar 6 03:51:38 CET 2009

Stefan Behnel wrote:

> I would have a hard time feeling happy
> if a real-world HTML parser was added to the stdlib that provides a totally
> different interface than the best (and fastest) XML library that the stdlib
> currently has.

I doubt there would be any objection to someone contributing wrappers
for upgrades, but I wouldn't count on them being used.

lxml may well be the best choice for xml.

BeautifulSoup and html5lib wouldn't even exist if that actually
mattered for most of *their* use cases.  Think of them more as
pre-processors, like tidylib.  If enough web pages were even valid
HTML (let alone valid and well-formed XML), no one would have bothered
to write these libraries.

BeautifulSoup has the advantage of being long-proven in practice, for
ugly html.  (You mention an lxml feature with a similar intent, but
for lxml, it is one of several addon features; for BeautifulSoup, this
is the whole point.)

html5lib does not have as long of a history, but it does have the
advantage of being almost an endorsed standard.  Much of HTML 5 is
documenting the workarounds that browser makers already actually
employ to handle erroneous input, so that the complexities can at
least stop compounding.  html5lib is intended as a reference
implementation, and the w3c editor has used it to motivate changes in
the specification draft.  (This may make it unsuitable for inclusion
in the stdlib today, because of timing issues.)  In other words, it
isn't just the heuristics of one particular development team; it is
(modulo bugs, and after official publication) the heuristics that the
major web browser makers have agreed to treat as "correct" in the
future.

-jJ