[lxml-dev] Proposal: Better html5lib Support
Hi, I'm lately working a lot with html5lib which has a tree builder that can generate an lxml tree which is awesome :-) There are however a few inconveniences in the html5lib lxml support. Mostly because the html5lib API is quite complex to use and I've seen that there is a beautiful soup parser support in html5lib, so why not move the html5lib tree builder into an lxml.html.html5 module or so that provides the same API as the html (that is `fragment_fromstring`, `document_fromstring`, etc.) html5lib is currently the most advanced HTML parsing module for Python I know about and it is able to deal with most HTML the same way popular browsers do. There is another small problem with html5lib and lxml interoperability that is the HTML5 doctype ("<!DOCTYPE HTML>") that lxml naturally cannot handle. I know that lxml is an XML library after all, but maybe support for this special doctype could be added. Regards, Armin
Hi, Armin Ronacher wrote:
I'm lately working a lot with html5lib which has a tree builder that can generate an lxml tree which is awesome :-)
:)
There are however a few inconveniences in the html5lib lxml support. Mostly because the html5lib API is quite complex to use and I've seen that there is a beautiful soup parser support in html5lib, so why not move the html5lib tree builder into an lxml.html.html5 module or so that provides the same API as the html (that is `fragment_fromstring`, `document_fromstring`, etc.)
I do not use html5lib myself, but I'm happily taking patches if you can fix it up in a more convenient way.
There is another small problem with html5lib and lxml interoperability that is the HTML5 doctype ("<!DOCTYPE HTML>") that lxml naturally cannot handle.
Does the "cannot handle" result in any visible problems?
I know that lxml is an XML library after all, but maybe support for this special doctype could be added.
This is something that is handled at the level of libxml2 and the system wide catalogs. Check the catalogs on your system to see if there is anything that resembles that doctype. Maybe it can be added. Stefan
Stefan Behnel
There are however a few inconveniences in the html5lib lxml support. Mostly because the html5lib API is quite complex to use and I've seen that there is a beautiful soup parser support in html5lib, so why not move the html5lib tree builder into an lxml.html.html5 module or so that provides the same API as the html (that is `fragment_fromstring`, `document_fromstring`, etc.)
I do not use html5lib myself, but I'm happily taking patches if you can fix it up in a more convenient way. I'll happily create a patch :-)
There is another small problem with html5lib and lxml interoperability that is the HTML5 doctype ("<!DOCTYPE HTML>") that lxml naturally cannot handle.
Does the "cannot handle" result in any visible problems? This document::
<title>foo</title> <p>blah Comes out as (lxml.etree.tostring):: <!DOCTYPE html PUBLIC "" ""> ... Not a big deal as not writing out data as a whole document and if I would then as HTML4. I think the html5 doctype is not a valid XML doctype but HTML5 as serialization format is not really XML. For HTML5 serialization one would have to use the html5lib serializer anyways and that could add a workaround for lxml. Regards, Armin
Hi, Armin Ronacher wrote:
Stefan Behnel
writes: There is another small problem with html5lib and lxml interoperability that is the HTML5 doctype ("<!DOCTYPE HTML>") that lxml naturally cannot handle. Does the "cannot handle" result in any visible problems? This document::
<title>foo</title> <p>blah
Comes out as (lxml.etree.tostring)::
<!DOCTYPE html PUBLIC "" ""> ...
We are actually serialising the DOCTYPE ourselves. Try this patch. I'm not sure if <!DOCTYPE html> is actually allowed in SGML, didn't find anything on that so far. If it isn't, I'll have to see if I can restrict the impact of the patch to this specific case. Note that you will need Cython 0.9.8 installed to build a patched lxml. Stefan
Stefan Behnel wrote:
Armin Ronacher wrote:
This document::
<title>foo</title> <p>blah
Comes out as (lxml.etree.tostring)::
<!DOCTYPE html PUBLIC "" ""> ...
I'm not sure if <!DOCTYPE html> is actually allowed in SGML, didn't find anything on that so far.
http://xml.coverpages.org//sgmlsyn/sgmlsyn.htm#P110 Looks like that's the right thing to do, so I committed a fixed version of the patch to the trunk. Stefan
Hi,
Stefan Behnel
http://xml.coverpages.org//sgmlsyn/sgmlsyn.htm#P110
Looks like that's the right thing to do, so I committed a fixed version of the patch to the trunk. Thanks a lot for the quick fix!
Regards, Armin
On 12 Jul 2008, at 23:14, Armin Ronacher wrote:
Comes out as (lxml.etree.tostring)::
<!DOCTYPE html PUBLIC "" ""> ...
Not a big deal as not writing out data as a whole document and if I would then as HTML4. I think the html5 doctype is not a valid XML doctype but HTML5 as serialization format is not really XML. For HTML5 serialization one would have to use the html5lib serializer anyways and that could add a workaround for lxml.
The intention, FWIW, is for XHTML5 documents to have no DOCTYPE. -- Geoffrey Sneddon http://gsnedders.com/
Hi,
Stefan Behnel
I do not use html5lib myself, but I'm happily taking patches if you can fix it up in a more convenient way. I created a patch now: http://paste.pocoo.org/show/79376/
That however has two disadvantages. For one it extends the lxml etree builder in a pretty ugly way but that could probably be improved, and it also creates etree.Comment objects and not etree.html.HtmlComments. The same problem exists with the soupparser, mainly because there is no way to generate HtmlComment objects without creating a segfault. (The only way is to use html.fromstring with the comment there, but that's an ugly hack). Regards, Armin
Hi, Armin Ronacher wrote:
Stefan Behnel
writes: I do not use html5lib myself, but I'm happily taking patches if you can fix it up in a more convenient way. I created a patch now: http://paste.pocoo.org/show/79376/
Thanks!
That however has two disadvantages. For one it extends the lxml etree builder in a pretty ugly way but that could probably be improved,
I'll take a look at it as soon as I find the time.
and it also creates etree.Comment objects and not etree.html.HtmlComments. The same problem exists with the soupparser, mainly because there is no way to generate HtmlComment objects without creating a segfault.
Yes. Although this isn't really a bug (you should use the Comment factory to create a comment, not the _Comment or HtmlComment classes), this seems to be a common misconception especially by new users. This behaviour will change in lxml 2.2, where calling an Element class already creates a new Element.
(The only way is to use html.fromstring with the comment there, but that's an ugly hack).
Using the etree.Comment() factory is just fine and will do the right thing. Stefan
Stefan Behnel
Yes. Although this isn't really a bug (you should use the Comment factory to create a comment, not the _Comment or HtmlComment classes), this seems to be a common misconception especially by new users. This behaviour will change in lxml 2.2, where calling an Element class already creates a new Element. There is no Comment factory in lxml.html, just in lxml.etree which i use right now. But that one creates a different object.
(The only way is to use html.fromstring with the comment there, but that's an ugly hack).
Using the etree.Comment() factory is just fine and will do the right thing. But it returns a lxml.etree._Comment and not an lxml.html.HtmlComment or I'm missing something.
Regards, Armin
Hi, Armin Ronacher wrote:
Yes. Although this isn't really a bug (you should use the Comment factory to create a comment, not the _Comment or HtmlComment classes), this seems to be a common misconception especially by new users. This behaviour will change in lxml 2.2, where calling an Element class already creates a new Element. There is no Comment factory in lxml.html, just in lxml.etree which i use right now. But that one creates a different object. [...] it returns a lxml.etree._Comment and not an lxml.html.HtmlComment or I'm missing something.
Ah, right. Interesting that no-one ever notices these things. :) Anyway, this will be fixed in 2.2, as soon as I get to implementing it. Stefan
Hi again, one more comment on Comments here. Armin Ronacher wrote:
Stefan Behnel writes:
Yes. Although this isn't really a bug (you should use the Comment factory to create a comment, not the _Comment or HtmlComment classes), this seems to be a common misconception especially by new users. This behaviour will change in lxml 2.2, where calling an Element class already creates a new Element. There is no Comment factory in lxml.html, just in lxml.etree which i use right now. But that one creates a different object.
While that is true, the impact on the html5lib tree builder is close to zero. The comment Element won't have the same interface when it's created, but it will have it when the user asks for the comment Element when the tree is finished. This is due to the way lxml.etree assigns proxies to XML nodes. Stefan
Armin Ronacher wrote:
I created a patch now: http://paste.pocoo.org/show/79376/
One quick comment: You are restricting the parser input to unicode strings in Py3. Besides that being wrong in general, can html5lib actually handle unicode string input? Stefan
One quick comment: You are restricting the parser input to unicode strings in Py3. Besides that being wrong in general, can html5lib actually handle unicode string input? The Python3 support is currently limited to str objects (aka unicode in Python
Hi,
Stefan Behnel
Stefan
Regards, Armin
Hi, Armin Ronacher wrote:
I created a patch now: http://paste.pocoo.org/show/79376/
Ok, I think the interface in the patch is ok. I'll take a deeper look at it later on to see if there's anything minor to improve in the code (especially once I've fixed the Comment stuff), but I've committed it for now. I renamed the module to html5parser, though, to make it match with the soupparser with which it shares at least the common intention of parsing stuff using a different parser library (and it doesn't do anything more than parsing anyway). For a moment I wondered why you separated out the _html5builder module, but it makes sense given that it's really just the glue module (and it also has an ugly API due to subclassing the html5lib TreeBuilder). It would be nice if you could improve the documentation with a couple of doctests, though, and provide some unittests if doctests aren't enough. I would like to make sure it works (and keeps working) as expected. Thanks for the patch, Stefan
participants (3)
-
Armin Ronacher
-
Geoffrey Sneddon
-
Stefan Behnel