[lxml-dev] Re: Re: Re: Python unicode string support in lxml

May 10, 2006

      Stefan Behnel wrote:
...
...
a Python Unicode string doesn't contain bytes; it contains a sequence of
Unicode code points, which are indexes into an abstract character space.
Ok, so then that means that unicode strings are completely unparsable. A
standards-compliant XML API should raise an error when it is asked to parse a
sequence of unicode code points. Let's see...
...
...
...
from elementtree.ElementTree import XML
XML(u"<test/>")
  <Element test at 2ad6c0771bd8>
What? I didn't put any bytes in there? Where did the element come from?
the CPython interpreter uses a default encoding, and attempts to *encode*
Unicode strings using this encoding when you pass them to an interface that
expects bytes.

if that doesn't work, the function won't even get called; instead, you'll get
a "can't encode" exception:
...
...
...
XML(u"<föö/>")
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
  File "<string>", line 67, in XML
UnicodeEncodeError: 'ascii' codec can't encode characters in position 2-3:
ordinal not in range(128)
still think that XML supports Unicode ?  or are you saying that the subset
of Unicode that happens to be ASCII is a good enough subset ?
...
...
a Python Unicode string doesn't have an encoding.
Well, it does, internally. And it's even well-defined across the whole platform.
that's an implementation detail.  a Python implementation may use whatever
representation it wants on the inside.   on the outside, there's no encoding
(in the traditional sense); all there is is a sequence of Unicode code points.
...
...
XML serialization is all about converting between the XML infoset (which
contains sequences of abstract code points) and the XML file format (which
contains bytes).  an XML file is a bunch of bytes, not a bunch of code
points.  storing a bunch of bytes as a bunch of code points is simply not
a very good idea, and is a great way to make people who don't understand
Unicode to write XML applications that will break when exposed to non-
ASCII text.
You're definitely the first to tell me that using unicode makes people write
programs that break for non-ascii text...
using Unicode with interfaces that expect bytes will break, if the Unicode
string contains the wrong things.  for example,
...
...
...
XML(u"<föö/>")
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
  File "<string>", line 67, in XML
UnicodeEncodeError: 'ascii' codec can't encode characters in position 2-3:
ordinal not in range(128)
and
...
...
...
f = open("file", "wb")
f.write(u"föö")
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode characters in position 1-2:
ordinal not in range(128)
and so on.  which means that
...
...
...
f = open("file.xml", "wb")
f.write(ET.tounicode(tree))
will sometimes work, and sometimes fail, and sometimes generate broken XML
files, depending on the data.  while
...
...
...
f = open("file.xml", "wb")
f.write(ET.tostring(tree))
will always do the right thing.

</F>

[lxml-dev] Re: Re: Re: Python unicode string support in lxml

Fredrik Lundh