iterparse and unicode
John Krukoff
jkrukoff at ltgc.com
Wed Aug 20 20:41:23 EDT 2008
On Wed, 2008-08-20 at 15:36 -0700, George Sakkis wrote:
> It seems xml.etree.cElementTree.iterparse() is not unicode aware:
>
> >>> from StringIO import StringIO
> >>> from xml.etree.cElementTree import iterparse
> >>> s = u'<name>\u03a0\u03b1\u03bd\u03b1\u03b3\u03b9\u03ce\u03c4\u03b7\u03c2</name>'
> >>> for event,elem in iterparse(StringIO(s)):
> ... print elem.text
> ...
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> File "<string>", line 64, in __iter__
> UnicodeEncodeError: 'ascii' codec can't encode characters in position
> 6-15: ordinal not in range(128)
>
> Am I using it incorrectly or it doesn't currently support unicode ?
>
> George
> --
> http://mail.python.org/mailman/listinfo/python-list
As iterparse expects an actual file as input, using a unicode string is
problematic. If you want to use iterparse, the simplest way would be to
encode your string before inserting it into the StringIO object, as so:
>>> for event,elem in iterparse(StringIO(s.encode('UTF8')):
... print elem.text
...
If you encode using UTF-8, you don't need to worry about the <?xml header
bit as suggested previously, as it's the default for XML.
If you're using unicode extensively, you should consider using lxml,
which implements the same interface as ElementTree, but handles unicode
better (though it also doesn't run your example above without first
encoding the string):
http://codespeak.net/lxml/parsing.html#python-unicode-strings
You may also find the target parser interface to be more accepting of
unicode than iterparse, though it requires a different parsing interface:
http://codespeak.net/lxml/parsing.html#the-target-parser-interface
--
John Krukoff <jkrukoff at ltgc.com>
Land Title Guarantee Company
More information about the Python-list
mailing list