Ian Bicking wrote:
translating HTML to XHTML is kind of an outstanding issue for lxml.html, and it seems reasonable to me that XHTML could be parsed into the same classes as HTML. The only real caveat there is that XHTML uses different (namespaced) tag names.
I agree that there is more we could do. For example, we could add "xhtml" as a serialisation method and do stuff internally to add a namespace declaration to the serialised "<html>" (iff there isn't a namespace declared already). I'm not sure if it would be an error if the tree contains non-HTML elements, I guess we could just leave that to the user.
If you remove the tag names, then the classes and the lookup applies just fine. (Presumably the lookup could be changed to support XHTML fairly easily.)
I would say so, yes. There would also be issues with the XPath expressions in things like html.clean, I assume. It would definitely be a good thing if the whole machinery could handle namespace-free HTML and namespaced XHTML equally well. Stefan