[lxml-dev] XHTML handling in lxml.html

1 Mar 2008


      Ian Bicking wrote:
...
translating HTML to XHTML is kind of an outstanding issue for lxml.html,
and it seems reasonable to me that XHTML could be parsed into the same
classes as HTML.  The only real caveat there is that XHTML uses different
(namespaced) tag names.
I agree that there is more we could do. For example, we could add "xhtml" as a
serialisation method and do stuff internally to add a namespace declaration to
the serialised "<html>" (iff there isn't a namespace declared already). I'm
not sure if it would be an error if the tree contains non-HTML elements, I
guess we could just leave that to the user.
...
If you remove the tag names, then the classes and
the lookup applies just fine.  (Presumably the lookup could be changed to
support XHTML fairly easily.)
I would say so, yes. There would also be issues with the XPath expressions in
things like html.clean, I assume. It would definitely be a good thing if the
whole machinery could handle namespace-free HTML and namespaced XHTML equally
well.

Stefan

[lxml-dev] XHTML handling in lxml.html

Stefan Behnel