[XML-SIG] Re: Issues with Unicode type
Tom Emerson
tree@basistech.com
Mon, 23 Sep 2002 12:14:00 -0400
> <?xml version=3D"1.0" encoding=3D"utf-8"?>
> <doc>𐠀</doc>
>
> and the length of the text node of the doc element is supposed to be 1
> instead of 2 as expected by my (naive) implementation of the length
> facet.
>
> What makes me think that it could be a generic issue with python is the
> following (kindly contributed by Uche):
>
> <uche> >>> hex(67584)
> <uche> '0x10800'
> <uche> >>> c =3D u"\u10800"
> <uche> >>> c
> <uche> u'\u10800'
> <uche> >>> len(c)
> <uche> 2
By default Python is using UTF-16 as its Unicode encoding. The
code-point that you specify, U+10800, is outside the BMP and hence is
represented by two surrogate characters in UTF-16.
If you were to recompile your Python installation to use UTF-32 as the
Unicode character type then I expect that you will get the length you
expect.
Consider:
>>> c= u"\u4e00"
>>> c
u'\u4e00'
>>> len(c)
1
--
Tom Emerson Basis Technology Corp.
Software Architect http://www.basistech.com
"Beware the lollipop of mediocrity: lick it once and you suck forever"