[lxml-dev] lxml.html, now with ignored namespaces!

I am using lxml to parse HTML documents, which include a custom namespace (for example, "<p cs:content='fruit'>FRUIT</p>"). In lxml 2.2.0, on Windows, this worked just fine, and elements could be processed based on this data. In lxml 2.2.2, on Linux, this fails. The above example becomes "<p content='fruit'>FRUIT</p>" as soon as it is parsed by lxml.html (or lxml.etree.HTMLParser()). I don't know if this is caused by the switch to Linux, or the upgrade to 2.2.2. I don't have control over the installation, so I can't switch to 2.2.2 under Windows, or 2.2.0 under Linux to check. I did find this reference (the only reference to this I could find) to the HTML ignoring namespaces: http://codespeak.net/lxml/lxmlhtml.html#running-html-doctests ...however, it wasn't doing that before, and it seems odd that this is only mentioned in the doctests section. Is there a way to work around this? Are custom namespaces simply not possible in lxml's HTML? Notes: 1. The XML parser will not work. Some documents will have legal HTML that breaks an XML parser, like "<br>". 2. Here is the sample code: -----
The output: ----- <html xmlns="http://www.w3.org/TR/1999/REC-html-in-xml" cs="http://something.com/cs" xml:lang="en" lang="en"><head><title>Help!</title></head><body><p>My namespaces are going to disappear!</p><p content="fruit">FRUIT</p></body></html> ----- Thomas Weigel

Hi, Thomas Weigel wrote:
You forgot to mention which versions of libxml2 you are using on both systems. That's likely the reason for the difference. http://codespeak.net/lxml/FAQ.html#i-think-i-have-found-a-bug-in-lxml-what-s... Stefan

Hello, Stefan Behnel wrote:
Thank you for being kind.
http://codespeak.net/lxml/FAQ.html#i-think-i-have-found-a-bug-in-lxml-what-s...
I have begun investigating down this path. I will not bother you again until I have finished there. In the meantime, I am working around the problem with a regular expression to replace 'custom_namespace:' with 'custom_namespace_', depending on whether or not lxml deletes the custom namespace. Thank you for your time. Thomas Weigel

Hi, I actually didn't read up to your example, sorry. Thomas Weigel wrote:
That's an XHTML document, for which the XML parser would be the right tool. If you have XHTML documents that contain unterminated <br> tags, they are not well-formed, and thus simply not XML, i.e. not XHTML. But you could try creating a custom XMLParser with the "recover" option, which will try to keep parsing despite errors. There's no guarantee that it won't kick out some data that it failed to parse, though, as usual when parsing broken documents. Obviously, the best way to deal with this kind of problem is fixing the input documents.
That's because HTML parsers are not namespace aware. Namespaces are simply not defined for HTML. But if you get a difference on different systems, I'd still suspect the reason to be different libxml2 versions. There's nothing lxml can do about this. Stefan

On 27 Jun 2009, at 07:23, Stefan Behnel wrote:
It should still be outputting an element with a name of "cs:content", it shouldn't be dropping the "cs:", as, as you say, there are not namespaces in HTML, so it has no meaning. My basic advice to the OP would be to use html5lib, which is far slower, but does cope with this fine. -- Geoffrey Sneddon <http://gsnedders.com/>

Hi, Thomas Weigel wrote:
You forgot to mention which versions of libxml2 you are using on both systems. That's likely the reason for the difference. http://codespeak.net/lxml/FAQ.html#i-think-i-have-found-a-bug-in-lxml-what-s... Stefan

Hello, Stefan Behnel wrote:
Thank you for being kind.
http://codespeak.net/lxml/FAQ.html#i-think-i-have-found-a-bug-in-lxml-what-s...
I have begun investigating down this path. I will not bother you again until I have finished there. In the meantime, I am working around the problem with a regular expression to replace 'custom_namespace:' with 'custom_namespace_', depending on whether or not lxml deletes the custom namespace. Thank you for your time. Thomas Weigel

Hi, I actually didn't read up to your example, sorry. Thomas Weigel wrote:
That's an XHTML document, for which the XML parser would be the right tool. If you have XHTML documents that contain unterminated <br> tags, they are not well-formed, and thus simply not XML, i.e. not XHTML. But you could try creating a custom XMLParser with the "recover" option, which will try to keep parsing despite errors. There's no guarantee that it won't kick out some data that it failed to parse, though, as usual when parsing broken documents. Obviously, the best way to deal with this kind of problem is fixing the input documents.
That's because HTML parsers are not namespace aware. Namespaces are simply not defined for HTML. But if you get a difference on different systems, I'd still suspect the reason to be different libxml2 versions. There's nothing lxml can do about this. Stefan

On 27 Jun 2009, at 07:23, Stefan Behnel wrote:
It should still be outputting an element with a name of "cs:content", it shouldn't be dropping the "cs:", as, as you say, there are not namespaces in HTML, so it has no meaning. My basic advice to the OP would be to use html5lib, which is far slower, but does cope with this fine. -- Geoffrey Sneddon <http://gsnedders.com/>
participants (3)
-
Geoffrey Sneddon
-
Stefan Behnel
-
Thomas Weigel