Question about unicode strings

Hi all I happen to be following the mailing lists of both lxml and rpclib. The guys at rpclib want to make a change to their code base to fix what they see as a 'quirk' of lxml. I am not qualified to comment, but I thought I would post the issue here in case anyone can suggest a cleaner solution. With Python 3 and version 2.3.3, if you pass a unicode string to etree.fromstring(...), and then retrieve a text node from the tree, you get a unicode string back. If you pass in a byte array, you get a byte array back. With Python 2 and version 2.2.2 (I don't have 2.3.3), if you pass a unicode string that contains a non-ASCII character, you get a unicode string back. If you pass a unicode string that contains only ASCII characters, you get a normal string back. This behaviour is causing a problem to a user of rpclib, so the proposal is that rpclib should always convert the string to unicode before returning it. I don't know how they know that they passed in a unicode string in the first place, but I assume they have a way of checking. The maintainer of rpclib says "If you disagree, speak now or forever hold your silence :))" So I thought I would mention it here and see if it sounds ok. Thanks Frank Millman

Frank Millman, 20.02.2012 15:20:
The guys at rpclib want to make a change to their code base to fix what they see as a 'quirk' of lxml. I am not qualified to comment, but I thought I would post the issue here in case anyone can suggest a cleaner solution.
With Python 3 and version 2.3.3, if you pass a unicode string to etree.fromstring(...), and then retrieve a text node from the tree, you get a unicode string back. If you pass in a byte array, you get a byte array back.
No, you always get a Unicode string for names and text in Python 3, regardless of what you used for parsing (or tree building in general).
With Python 2 and version 2.2.2 (I don't have 2.3.3), if you pass a unicode string that contains a non-ASCII character, you get a unicode string back. If you pass a unicode string that contains only ASCII characters, you get a normal string back.
Again, that's the case regardless of the original input. This behaviour was also recently discussed here: http://thread.gmane.org/gmane.comp.python.lxml.devel/6313/focus=6314
This behaviour is causing a problem to a user of rpclib
It shouldn't normally. Apparently, the problem was that the user passed the result into unicodedata.normalize(), which rejected Py2-str as input. Sounds like a bug in the unicodedata module to me, since str is supposed to auto-decode into Unicode automatically on Py2. I'm actually happy that Py3 finally fixed these issues...
so the proposal is that rpclib should always convert the string to unicode before returning it. I don't know how they know that they passed in a unicode string in the first place, but I assume they have a way of checking.
The maintainer of rpclib says "If you disagree, speak now or forever hold your silence :))"
So I thought I would mention it here and see if it sounds ok.
It's perfectly ok, they can just wrap it in unicode() when running in Py2, or concatenate it with the empty unicode string. All that will change is the type of the object (well, and its memory consumption and the time it takes to build it, but I don't think that matters here). Stefan

Stefan Behnel wrote:
With Python 3 and version 2.3.3, if you pass a unicode string to etree.fromstring(...), and then retrieve a text node from
the tree, you get
a unicode string back. If you pass in a byte array, you get a byte array back.
No, you always get a Unicode string for names and text in Python 3, regardless of what you used for parsing (or tree building in general).
You are right - sorry about that. I got confused by the following -
a = etree.fromstring(b'<?xml version="1.0" encoding="UTF-8"?><root><child>da ta\u00e7</child></root>') a[0].text 'data\\u00e7' a = etree.fromstring('<?xml version="1.0"?><root><child>data\u00e7</child></ root>') a[0].text 'dataç'
In the first case I pass in a byte array, and the text node is displayed in my terminal (Windows Command Prompt) with the unicode character escaped. In the second case I pass in a unicode string and the resulting unicode character is displayed normally. As you say, they are both unicode strings. I don't know why they display differently. Frank

Le 21/02/2012 06:52, Frank Millman a écrit :
a = etree.fromstring(b'<?xml version="1.0" encoding="UTF-8"?><root><child>da ta\u00e7</child></root>') a[0].text 'data\\u00e7'
The \uHHHH escaping is not allowed in Python 3 byte strings, so your XML source contains a backslash, then the letter u, etc. You see that the backslash is doubled (escaped) in your output.
a = etree.fromstring('<?xml version="1.0"?><root><child>data\u00e7</child></ root>')
> a[0].text 'dataç'
Here the XML source is completely different (regardless of type/encoding) and contains an actual ç. If you want it escaped in a Python literal of utf8 bytes, try \xc3\xa7. Regards, -- Simon Sapin

Simon Sapin wrote:
Le 21/02/2012 06:52, Frank Millman a écrit :
a = etree.fromstring(b'<?xml version="1.0" encoding="UTF-8"?><root><child>da ta\u00e7</child></root>') a[0].text 'data\\u00e7'
The \uHHHH escaping is not allowed in Python 3 byte strings, so your XML source contains a backslash, then the letter u, etc. You see that the backslash is doubled (escaped) in your output.
a = etree.fromstring('<?xml version="1.0"?><root><child>data\u00e7</child></ root>')
>> a[0].text 'dataç'
Here the XML source is completely different (regardless of type/encoding) and contains an actual ç. If you want it escaped in a Python literal of utf8 bytes, try \xc3\xa7.
Doh! Of course. Thank you, Simon. Frank
participants (3)
-
Frank Millman
-
Simon Sapin
-
Stefan Behnel