[XML-SIG] Re: Issues with Unicode type

Tom Emerson tree@basistech.com
Mon, 23 Sep 2002 13:58:49 -0400


Uche Ogbuji writes:
> IIRC, UTF-16 supports the representation of characters outside the BMP by 
> using surrogate pairs (SP).  If so, then the scary solution of requiring XML 
> users to compile Python to use UCS-4 can be put aside.

Yes, that is what I (thought I) said in my previous response: since
internally Python is representing characters outside the BMP as a
surrogate pair in UTF-16, the length of a Unicode string using these
characters is 2 --- two UTF-16 characters.

> The question would then be how to get a surrogate pair into a Python unicode 
> object.  On a hunch, I tried:
> 
> >>> c = u"\uD800\uDC00"
> >>> len(c)
> 2

That works. You can also use \U notation:

>>> c = u"\U00010000"
>>> len(c)
2
>>> c
u'\u00010000'
>>> c[0]
u'\ud800'
>>> c[1]
u'\udc00'

If you compile your Python installation to use "wide" Unicode
characters (i.e., UTF-32), then I expect the behavior to be

>>> c = u"\U00010000"
>>> len(c)
1
>>> len(c)
u'\U00010000'

-- 
Tom Emerson                                          Basis Technology Corp.
Software Architect                                 http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"