Stefan Behnel wrote:
a Python Unicode string doesn't contain bytes; it contains a sequence of Unicode code points, which are indexes into an abstract character space.
Ok, so then that means that unicode strings are completely unparsable. A standards-compliant XML API should raise an error when it is asked to parse a sequence of unicode code points. Let's see...
from elementtree.ElementTree import XML XML(u"<test/>") <Element test at 2ad6c0771bd8>
What? I didn't put any bytes in there? Where did the element come from?
the CPython interpreter uses a default encoding, and attempts to *encode* Unicode strings using this encoding when you pass them to an interface that expects bytes. if that doesn't work, the function won't even get called; instead, you'll get a "can't encode" exception:
XML(u"<föö/>") Traceback (most recent call last): File "<stdin>", line 1, in ? File "<string>", line 67, in XML UnicodeEncodeError: 'ascii' codec can't encode characters in position 2-3: ordinal not in range(128)
still think that XML supports Unicode ? or are you saying that the subset of Unicode that happens to be ASCII is a good enough subset ?
a Python Unicode string doesn't have an encoding.
Well, it does, internally. And it's even well-defined across the whole platform.
that's an implementation detail. a Python implementation may use whatever representation it wants on the inside. on the outside, there's no encoding (in the traditional sense); all there is is a sequence of Unicode code points.
XML serialization is all about converting between the XML infoset (which contains sequences of abstract code points) and the XML file format (which contains bytes). an XML file is a bunch of bytes, not a bunch of code points. storing a bunch of bytes as a bunch of code points is simply not a very good idea, and is a great way to make people who don't understand Unicode to write XML applications that will break when exposed to non- ASCII text.
You're definitely the first to tell me that using unicode makes people write programs that break for non-ascii text...
using Unicode with interfaces that expect bytes will break, if the Unicode string contains the wrong things. for example,
XML(u"<föö/>") Traceback (most recent call last): File "<stdin>", line 1, in ? File "<string>", line 67, in XML UnicodeEncodeError: 'ascii' codec can't encode characters in position 2-3: ordinal not in range(128)
and
f = open("file", "wb") f.write(u"föö") Traceback (most recent call last): File "<stdin>", line 1, in ? UnicodeEncodeError: 'ascii' codec can't encode characters in position 1-2: ordinal not in range(128)
and so on. which means that
f = open("file.xml", "wb") f.write(ET.tounicode(tree))
will sometimes work, and sometimes fail, and sometimes generate broken XML files, depending on the data. while
f = open("file.xml", "wb") f.write(ET.tostring(tree))
will always do the right thing. </F>