Re: [lxml-dev] [Question #61584]: Is it possible to make lxml use hex instead of decimal for unicode entities?
usernamenumber wrote:
I am porting a perl/SAX tool to python/lxml. Ideally, given the same input, the new tool should produce the same output as the old tool. In fact, it introduces a number of problems for me if this is not the case.
It's always bad style to make applications depend on a specific XML serialisation done by a specific tool. That's exactly what canonical XML (C14N) was designed for.
One annoying problem I am encountering is that SAX seems to store unicode entity IDs in hex, whereas lxml uses decimal, regardless of what value is used in the input:
import lxml.etree as etree example_sax_output = "<foo>Copyright © 2009 Foocorp, Inc</foo>" # Note: xA9 e = etree.fromstring(example_sax_output) etree.tostring(e) <foo>Copyright © 2009 Foocorp, Inc</foo> # Note: 169
Is it possible to avoid this without doing something horribly kludgey like going through the output with a regex search and manually converting the values to hex?
There isn't a straight way to do that. Decimal character references were chosen for compatibility with ElementTree, which uses "xmlcharrefreplace". However, if you have a bit of memory and do not care too much about raw performance, you can do this: # Python 2.6 unicode_xml = etree.tostring(tree, encoding=unicode) bytes_xml = b''.join(chr(c) if c < 0x80 else b'%X;' % c for c in imap(ord, unicode_xml)) There's also a separate serialiser API in libxml2 that happens to output hex entities. However, that's not used for backward compatibility reasons. Stefan
Stefan Behnel wrote:
usernamenumber wrote:
I am porting a perl/SAX tool to python/lxml. Ideally, given the same input, the new tool should produce the same output as the old tool. In fact, it introduces a number of problems for me if this is not the case.
It's always bad style to make applications depend on a specific XML serialisation done by a specific tool. That's exactly what canonical XML (C14N) was designed for.
And, as a matter of fact, C14N uses hex charrrefs: http://www.w3.org/TR/xml-c14n.html#Example-Chars So maybe you should take a look at that. http://codespeak.net/lxml/api.html#write-c14n-on-elementtree Stefan
participants (1)
-
Stefan Behnel