Mailman 3 [lxml-dev] Proposal: Better html5lib Support - lxml - The Python XML Toolkit

12 Jul 2008

      Hi,

I'm lately working a lot with html5lib which has a tree builder that can
generate an lxml tree which is awesome :-)

There are however a few inconveniences in the html5lib lxml support.  Mostly
because the html5lib API is quite complex to use and I've
seen that there is a beautiful soup parser support in html5lib, so why not
move the html5lib tree builder into an lxml.html.html5 module or so that
provides the same API as the html (that is `fragment_fromstring`,
`document_fromstring`, etc.)

html5lib is currently the most advanced HTML parsing module for Python I
know about and it is able to deal with most HTML the same way popular
browsers do.

There is another small problem with html5lib and lxml interoperability that
is the HTML5 doctype ("<!DOCTYPE HTML>") that lxml naturally cannot handle.
I know that lxml is an XML library after all, but maybe support for this
special doctype could be added.

Regards,
Armin

[lxml-dev] Proposal: Better html5lib Support

Armin Ronacher

Stefan Behnel

Armin Ronacher

Stefan Behnel

Stefan Behnel

Armin Ronacher

Geoffrey Sneddon

Armin Ronacher

Stefan Behnel

Armin Ronacher

Stefan Behnel

Stefan Behnel

Stefan Behnel

Armin Ronacher

Stefan Behnel

tags

participants (3)