[XML-SIG] Re: Issues with Unicode type
Uche Ogbuji
uche.ogbuji@fourthought.com
Mon, 23 Sep 2002 11:31:51 -0600
> > <?xml version=3D"1.0" encoding=3D"utf-8"?>
> > <doc>𐠀</doc>
> >
> > and the length of the text node of the doc element is supposed to be 1
> > instead of 2 as expected by my (naive) implementation of the length
> > facet.
> >
> > What makes me think that it could be a generic issue with python is the
> > following (kindly contributed by Uche):
> >
> > <uche> >>> hex(67584)
> > <uche> '0x10800'
> > <uche> >>> c =3D u"\u10800"
> > <uche> >>> c
> > <uche> u'\u10800'
> > <uche> >>> len(c)
> > <uche> 2
>
> By default Python is using UTF-16 as its Unicode encoding. The
> code-point that you specify, U+10800, is outside the BMP and hence is
> represented by two surrogate characters in UTF-16.
>
> If you were to recompile your Python installation to use UTF-32 as the
> Unicode character type then I expect that you will get the length you
> expect.
>
> Consider:
>
> >>> c= u"\u4e00"
> >>> c
> u'\u4e00'
> >>> len(c)
> 1
Hmm. I'm going to open my mouth and show off my ignorance now. I should
probably spend some time with my Tony Graham before ever posting on Unicode,
but I don't have the time right now, and besides, there is no better way to
get Eric an answer than to say something wrong that has to be corrected by one
of the many Unicode gurus who I know hang around here :-)
IIRC, UTF-16 supports the representation of characters outside the BMP by
using surrogate pairs (SP). If so, then the scary solution of requiring XML
users to compile Python to use UCS-4 can be put aside.
The question would then be how to get a surrogate pair into a Python unicode
object. On a hunch, I tried:
>>> c = u"\uD800\uDC00"
>>> len(c)
2
So I guess the answer isn't just using the literal characters in the SP.
--
Uche Ogbuji Fourthought, Inc.
http://uche.ogbuji.net http://4Suite.org http://fourthought.com
Apache 2.0 API - http://www-106.ibm.com/developerworks/linux/library/l-apache/
Python&XML column: Tour of Python/XML - http://www.xml.com/pub/a/2002/09/18/py.
html
Python/Web Services column: xmlrpclib - http://www-106.ibm.com/developerworks/w
ebservices/library/ws-pyth10.html