[lxml-dev] Unicode munging in element tag and text
Hi all, thanks for a great library. :-) I found a rather peculiar behavior in Unicode object handling for element tag and text. It looks like they get converted to a plain string if they only contains ASCII chars, but not always. ElementTree instead always keeps them as Unicode objects.
from lxml.etree import Element as lxElem from elementtree.ElementTree import Element as etElem
1) Let's first build an element from a Unicode object with ASCII chars; only ElementTree keeps it as Unicode:
lx = lxElem(u'ascii') et = etElem(u'ascii') lx.tag 'ascii' et.tag u'ascii'
while when the Unicode object contains non-ASCII chars, both libraries correctly keep it as Unicode:
lx = lxElem(u'mòrèthànàscìì') et = etElem(u'mòrèthànàscìì') lx.tag u'm\xf2r\xe8th\xe0n\xe0sc\xec\xec' et.tag u'm\xf2r\xe8th\xe0n\xe0sc\xec\xec'
2) The same happens for the element text; ASCII:
lx.text = u'ascii' et.text = u'ascii' lx.text 'ascii' et.text u'ascii'
non-ASCII:
lx.text = u'mòrèthànàscìì' et.text = u'mòrèthànàscìì' lx.text u'm\xf2r\xe8th\xe0n\xe0sc\xec\xec' et.text u'm\xf2r\xe8th\xe0n\xe0sc\xec\xec'
3) OTOH, when directly setting the element tag, lxml keeps the Unicode object too:
lx.tag = u'ascii' et.tag = u'ascii' lx.tag u'ascii' et.tag u'ascii'
while both libraries keep working correctly when using non-ASCII chars:
lx.tag = u'mòrèthànàscìì' et.tag = u'mòrèthànàscìì' lx.tag u'm\xf2r\xe8th\xe0n\xe0sc\xec\xec' et.tag u'm\xf2r\xe8th\xe0n\xe0sc\xec\xec'
This inconsistent behavior does not seem intentional. In my opinion, in the cases 1) and 2) lxml should work as it already does in the case 3), and as ElementTree always does. Thanks again. -- Nicola Larosa - http://www.tekNico.net/ There is more money being spent on breast implants and Viagra today than on Alzheimer's research. This means that by 2040, there should be a large elderly population with perky boobs and huge erections and absolutely no recollection of what to do with them. -- David Icke, April 2006
Nicola Larosa wrote:
This inconsistent behavior does not seem intentional. In my opinion, in the cases 1) and 2) lxml should work as it already does in the case 3), and as ElementTree always does.
in Python 2.X, Unicode strings are compatible with 8-bit ASCII-only strings, so the lxml.etree behaviour is perfectly acceptable. I see no reason to force an implementation that doesn't use Python objects for its internal storage to be forced to keep track of the original type. (especially not since the Unicode string type will disappear in Python 3.0; all strings will be able to hold Unicode data). </F>
Hi Nicola, Nicola Larosa wrote:
This inconsistent behavior does not seem intentional. In my opinion, in the cases 1) and 2) lxml should work as it already does in the case 3), and as ElementTree always does.
At least under Python 2.x, lxml.etree will continue to return unicode or plain strings depending on their content. Internally, everything is stored as UTF-8, so this is for performance reasons as we can avoid unicode conversion for plain ASCII strings (which are very common, just think of numeric data, dates, etc.). This may change in Python 3.x, but then, there may be more to change, so that's not in our scope for now. Stefan
Stefan Behnel wrote:
At least under Python 2.x, lxml.etree will continue to return unicode or plain strings depending on their content. Internally, everything is stored as UTF-8, so this is for performance reasons as we can avoid unicode conversion for plain ASCII strings (which are very common, just think of numeric data, dates, etc.).
Any benchmarks supporting this decision? -- Nicola Larosa - http://www.tekNico.net/ There is more money being spent on breast implants and Viagra today than on Alzheimer's research. This means that by 2040, there should be a large elderly population with perky boobs and huge erections and absolutely no recollection of what to do with them. -- David Icke, April 2006
Fredrik Lundh wrote:
Are you trying to use the "premature optimization is evil" argument against people who's spent more time than anyone else on optimizing Python's string subsystem? ;-)
I, for one, welcome our new Iceland Sprint overlords. ;-P -- Nicola Larosa - http://www.tekNico.net/ Many software developers have become hostage to the development frameworks that they utilise. In turn, many frameworks have made session state a fundamental building block of web development because it permits sloppy design. -- Alan Dean, April 2006
Hi Nicola, Nicola Larosa wrote:
Stefan Behnel wrote:
At least under Python 2.x, lxml.etree will continue to return unicode or plain strings depending on their content. Internally, everything is stored as UTF-8, so this is for performance reasons as we can avoid unicode conversion for plain ASCII strings (which are very common, just think of numeric data, dates, etc.).
Any benchmarks supporting this decision?
Pretty short question for a long answer. Your third point was that .tag returned the original type. This is done through caching the original input, which avoids some 95% of the work required to rebuild it on each access (last time I ran the benchmark, at least). This means, it is 95% faster if a program frequently accesses the same tag name. We could instead recreate a string for the result to make it fit the behaviour of .text, but why if it's not more than overhead? As Fredrik said, plain strings and unicode strings are compatible, no need to convert one into the other for normal string operations. It's actually your fault if you waste memory and processing time by passing a unicode string where a plain string would do. As for the first two points, skipping through a string to see if any non-ASCII characters are in there is trivial and fast (7-bit vs. 8-bit), creating a plain string from it means allocating the same amount of memory, copying the string (which most likely is in the processor cache already by then) using a platform-optimised memcpy (or whatever, note that we already know the length of the string by then) and then create a Python object for it. Converting it to unicode means allocating two or four times the memory, doing a per-character conversion step by step (from multi-byte UTF-8) and then create a Python object for it. I didn't do much benchmarking here, but given the "95%" result above (meaning, the majority of work is the actual string instantiation), I simply assume that avoiding to do the character conversion is worth it if ASCII content is frequent. Stefan
participants (3)
-
Fredrik Lundh
-
Nicola Larosa
-
Stefan Behnel