[lxml-dev] extracting .text strings systematically in unicode
Hello, I am working on a small XML to SQL application. Input attribute values and text fields usually are unicode but not always. They are fed into the attributes of an object which only accepts unicode input and raise an exception if the data is an 'str' instead (said object is a storm persisted class). My problem seems to be that lxml extracts text element either as an 'str' or a 'unicode', depending on the text element, as shown on the code snippets : from lxml.etree import XML type( XML('<tag>element</tag>').text ) <type 'str'> type( XML('<tag>élément</tag>').text ) <type 'unicode'> So far, it seems that my only choice is to 'cast' every extraction of the xml doc to unicode, which is cumbersome and does not seem necessary. Example : self.name = unicode( element.get('name') ) for child in element: setattr(self, child.tag, unicode( child.text ) ) Is there a switch in the lxml module to make the strings of the xml document appears predictably as unicode even is the string can be represented a simple 'str'? Thank you,
Hi, Jean Daniel wrote:
Is there a switch in the lxml module to make the strings of the xml document appears predictably as unicode even is the string can be represented a simple 'str'?
No, that's the way ElementTree works (and lxml is ET compatible). This is mainly for performance reasons, since ASCII strings are extremely common in XML. Creating a plain ASCII str is more memory efficient and a lot faster than creating a unicode object, and in Py2 it behaves the same in almost all situations (except in APIs that specifically test for unicode objects as input). You can either switch to Py3.0 where lxml always returns unicode strings, or you can stick to casting the string yourself. BTW, it's faster to do u""+s than to do unicode(s) although it might be considered less readable. It has the advantage of raising an exception for non-strings, though. Stefan
The first one is the one the raises an exception for non-strings? John ---- You can either switch to Py3.0 where lxml always returns unicode strings, or you can stick to casting the string yourself. BTW, it's faster to do u""+s than to do unicode(s) although it might be considered less readable. It has the advantage of raising an exception for non-strings, though. Stefan
John Lovell wrote:
The first one is the one the raises an exception for non-strings?
Python 2.6.1 (r261:67515, Dec 7 2008, 21:12:01) [GCC 4.1.3 20070929 (prerelease) (Ubuntu 4.1.2-16ubuntu2)] on linux2 Type "help", "copyright", "credits" or "license" for more information.
u""+1 Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: coercing to Unicode: need string or buffer, int found
Stefan
Stefan Behnel wrote:
John Lovell wrote:
The first one is the one the raises an exception for non-strings?
Python 2.6.1 (r261:67515, Dec 7 2008, 21:12:01) [GCC 4.1.3 20070929 (prerelease) (Ubuntu 4.1.2-16ubuntu2)] on linux2 Type "help", "copyright", "credits" or "license" for more information.
u""+1 Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: coercing to Unicode: need string or buffer, int found
Or to present something more lxml related (session edited for readability): Python 2.6.1 (r261:67515, Dec 7 2008, 21:12:01) [GCC 4.1.3 20070929 (prerelease) (Ubuntu 4.1.2-16ubuntu2)] on linux2 Type "help", "copyright", "credits" or "license" for more information.
import lxml.etree as et root = et.fromstring("<a><!--test--></a>")
root.tag 'a' unicode(root.tag) u'a' u""+root.tag u'a'
root[0].tag <built-in function Comment> unicode(root[0].tag) u'<built-in function Comment>' u""+root[0].tag Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: coercing to Unicode: need string or buffer, \ builtin_function_or_method found
Stefan
On Tue, 2008-12-09 at 10:11 -0800, John Lovell wrote:
The first one is the one the raises an exception for non-strings?
John
Yes: Python 2.5.2 (r252:60911, Oct 5 2008, 19:24:49) [GCC 4.3.2] on linux2 Type "help", "copyright", "credits" or "license" for more information.
unicode(1) u'1' u""+1 Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: coercing to Unicode: need string or buffer, int found
participants (4)
-
J. Clifford Dyer
-
Jean Daniel
-
John Lovell
-
Stefan Behnel