[XML-SIG] Re: Issues with Unicode type
Tom Emerson
tree@basistech.com
Mon, 23 Sep 2002 13:58:49 -0400
Uche Ogbuji writes:
> IIRC, UTF-16 supports the representation of characters outside the BMP by
> using surrogate pairs (SP). If so, then the scary solution of requiring XML
> users to compile Python to use UCS-4 can be put aside.
Yes, that is what I (thought I) said in my previous response: since
internally Python is representing characters outside the BMP as a
surrogate pair in UTF-16, the length of a Unicode string using these
characters is 2 --- two UTF-16 characters.
> The question would then be how to get a surrogate pair into a Python unicode
> object. On a hunch, I tried:
>
> >>> c = u"\uD800\uDC00"
> >>> len(c)
> 2
That works. You can also use \U notation:
>>> c = u"\U00010000"
>>> len(c)
2
>>> c
u'\u00010000'
>>> c[0]
u'\ud800'
>>> c[1]
u'\udc00'
If you compile your Python installation to use "wide" Unicode
characters (i.e., UTF-32), then I expect the behavior to be
>>> c = u"\U00010000"
>>> len(c)
1
>>> len(c)
u'\U00010000'
--
Tom Emerson Basis Technology Corp.
Software Architect http://www.basistech.com
"Beware the lollipop of mediocrity: lick it once and you suck forever"