[lxml-dev] XHTML handling in lxml.html
Ian Bicking wrote:
translating HTML to XHTML is kind of an outstanding issue for lxml.html, and it seems reasonable to me that XHTML could be parsed into the same classes as HTML. The only real caveat there is that XHTML uses different (namespaced) tag names.
I agree that there is more we could do. For example, we could add "xhtml" as a serialisation method and do stuff internally to add a namespace declaration to the serialised "<html>" (iff there isn't a namespace declared already). I'm not sure if it would be an error if the tree contains non-HTML elements, I guess we could just leave that to the user.
If you remove the tag names, then the classes and the lookup applies just fine. (Presumably the lookup could be changed to support XHTML fairly easily.)
I would say so, yes. There would also be issues with the XPath expressions in things like html.clean, I assume. It would definitely be a good thing if the whole machinery could handle namespace-free HTML and namespaced XHTML equally well. Stefan
Stefan Behnel wrote:
Ian Bicking wrote:
translating HTML to XHTML is kind of an outstanding issue for lxml.html, and it seems reasonable to me that XHTML could be parsed into the same classes as HTML. The only real caveat there is that XHTML uses different (namespaced) tag names.
I agree that there is more we could do. For example, we could add "xhtml" as a serialisation method and do stuff internally to add a namespace declaration to the serialised "<html>" (iff there isn't a namespace declared already). I'm not sure if it would be an error if the tree contains non-HTML elements, I guess we could just leave that to the user.
I think one of the justifications for XHTML (what few their are ;) is that it can represent non-HTML elements reasonably elegantly. But I don't think this is a problem.
If you remove the tag names, then the classes and the lookup applies just fine. (Presumably the lookup could be changed to support XHTML fairly easily.)
I would say so, yes. There would also be issues with the XPath expressions in things like html.clean, I assume. It would definitely be a good thing if the whole machinery could handle namespace-free HTML and namespaced XHTML equally well.
This came up with Deliverance as well, as some people want to use XHTML. Because of all the namespace/URI/prefix confusion, it seems quite awkward. The most elegant solution, at least in that context, seems like using just HTML internally. So if we get XHTML, we parse it as XML and remove the namespace from every element in the namespace http://www.w3.org/1999/xhtml. Then when serializing to XHTML, we add that namespace to everything that doesn't have a namespace (and maybe with a whitelist of elements in XHTML). Then internally there's a consistent representation, and the XHTML/HTML division can be treated more like a parsing/serialization issue. Arguably the distinction is more than just serialization, and {http://www.w3.org/1999/xhtml}div is really distinct from a plain div. But that's not an argument I'd make ;) Mostly as an aside, I'm planning to parse XHTML using the XML parser, but if it fails to use the HTML parser, as the parsing error behavior of the two parsers is so different that they aren't really equivalent. Or... put another way, if you consider the error-tolerant HTML parser to be suitable for a task, then the error-intolerant XML parser may not be suitable (by itself). Ian
participants (2)
-
Ian Bicking
-
Stefan Behnel