[lxml-dev] Unicode behaviour of Element.text

I tried to figure out the unicode-behaviour of Element.text. The lxml documentation does mention how parsing unicode data and serializing to unicode works, but I can not find any information on how Element.text returns strings. From what I can see it appears that Element.text returns either a str or a unicode instance, depending on the presence of non-ASCII text. That behaviour feels inconsistent, and for unicode using applications it means that every use of Element.text has to be written as unicode(node.text), which is not very pretty. Would it be possible to add an option to make the text attribute always return a unicode instance? Wichert.

Wichert Akkerman, 18.03.2010 15:16:
I tried to figure out the unicode-behaviour of Element.text. The lxml documentation does mention how parsing unicode data and serializing to unicode works, but I can not find any information on how Element.text returns strings. From what I can see it appears that Element.text returns either a str or a unicode instance, depending on the presence of non-ASCII text. That behaviour feels inconsistent, and for unicode using applications it means that every use of Element.text has to be written as unicode(node.text), which is not very pretty. Would it be possible to add an option to make the text attribute always return a unicode instance?
Since this has been asked a couple of time before, here's a short answer: That's how ElementTree works in Py2 and lxml.etree is compatible with it. It's also faster for plain ASCII data (which is common). In Python 3, lxml.etree always returns Unicode strings for .tag, .text and .tail. Stefan
participants (2)
-
Stefan Behnel
-
Wichert Akkerman