iterparse and unicode
george.sakkis at gmail.com
Thu Aug 21 04:46:04 CEST 2008
Thank you both for the suggestions. I made a few more experiments to
understand how iterparse behaves with respect to three dimensions:
a. Is the encoding declared in the header (if there is one) ?
b. Is the text ascii-encodable (i.e. within range(128)) ?
c. Does the passed file object's read() method return str or unicode
(e.g. codecs.open(f,encoding='utf8')) ?
Feel free to correct me if I misinterpreted what is really happening.
As John Krukoff mentioned, omitting the encoding is equivalent to
encoding="utf-8" for all other combinations. This leaves (b) and (c).
If a text node is ascii-encodable, iterparse() returns it as a byte
string, regardless of the declared encoding and the input file's
read() return type.
(c) becomes relevant only if a text node is not ascii-encodable. In
this case iterparse() returns unicode if the underlying file's read()
returns bytes in an encoding that matches (or at least is compatible
with) the declared encoding in the header (or the implied utf8).
Passing a file object whose read() returns unicode characters
implicitly encodes them to ascii, which raises a UnicodeEncodeError
since the text node is not ascii-encodable.
It's interesting that the element text attributes after a successful
parse do not necessarily have the same type, i.e. all be str or all
unicode. I ported some text extraction code from BeautifulSoup (which
handles all text as unicode) and I was surprized to find out that in
xml.etree the returned text's type is not fixed, even within the same
file. Although it's not a bug, having a mixed collection of byte and
unicode strings from the same source makes me somewhat uneasy.
More information about the Python-list