[XML-SIG] Re: Issues with Unicode type

Tom Emerson tree@basistech.com
Mon, 23 Sep 2002 12:14:00 -0400


> <?xml version=3D"1.0" encoding=3D"utf-8"?>
>       <doc>&#67584;</doc>
> 
> and the length of the text node of the doc element is supposed to be 1
> instead of 2 as expected by my (naive) implementation of the length
> facet.
> 
> What makes me think that it could be a generic issue with python is the
> following (kindly contributed by Uche):
> 
> <uche> >>> hex(67584)
> <uche> '0x10800'
> <uche> >>> c =3D u"\u10800"
> <uche> >>> c
> <uche> u'\u10800'
> <uche> >>> len(c)
> <uche> 2

By default Python is using UTF-16 as its Unicode encoding. The
code-point that you specify, U+10800, is outside the BMP and hence is
represented by two surrogate characters in UTF-16.

If you were to recompile your Python installation to use UTF-32 as the
Unicode character type then I expect that you will get the length you
expect.

Consider:

>>> c= u"\u4e00"
>>> c
u'\u4e00'
>>> len(c)
1

-- 
Tom Emerson                                          Basis Technology Corp.
Software Architect                                 http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"