[XML-SIG] Re: Issues with Unicode type

23 Sep 2002 13:27:22 -0600

On Mon, 2002-09-23 at 11:58, Tom Emerson wrote:
> Uche Ogbuji writes:
> > IIRC, UTF-16 supports the representation of characters outside the BMP by 
> > using surrogate pairs (SP).  If so, then the scary solution of requiring XML 
> > users to compile Python to use UCS-4 can be put aside.
> 
> Yes, that is what I (thought I) said in my previous response: since
> internally Python is representing characters outside the BMP as a
> surrogate pair in UTF-16, the length of a Unicode string using these
> characters is 2 --- two UTF-16 characters.

No.  A surrogate pair is one character.  It takes up 2 16-bit values,
but this is not the same as taking up 2 characters.  The whole point of
a variable-length encoding such as UTF-16 is that the number of storage
values is not always the same as the number of characters.

Eric found this message where Guido does a decent job of summarizing the
various issues, though I'm not sure I agree with his conclusion:

http://mail.python.org/pipermail/i18n-sig/2001-June/001107.html

I should note that based on code Eric found in James Clark's code, Java
doesn't treat surrogates specially internally, either, which I guess
tends to bolster Guido's POV  :-(

> > The question would then be how to get a surrogate pair into a Python unicode 
> > object.  On a hunch, I tried:
> > 
> > >>> c = u"\uD800\uDC00"
> > >>> len(c)
> > 2
> 
> That works. You can also use \U notation:

No.  My whole point is that it didn't work.  len(c) would be 1, not 2 if
the characters were properly treated as a surrogate pair. 

> >>> c = u"\U00010000"
> >>> len(c)
> 2
> >>> c
> u'\u00010000'
> >>> c[0]
> u'\ud800'
> >>> c[1]
> u'\udc00'
> 
> If you compile your Python installation to use "wide" Unicode
> characters (i.e., UTF-32), then I expect the behavior to be
> 
> >>> c = u"\U00010000"
> >>> len(c)
> 1
> >>> len(c)
> u'\U00010000'

Yes.  Don't you see that this means that the behavior as compiled with
UTF-16 is wrong from a *character set* point of view?  The same code
point is *one* character whether encoded in UTF-7, UTF-8, UTF-16,
UTF-32, UCS-2, UCS-4, etc.  It is never more than one character.

-- 
Uche Ogbuji                                    Fourthought, Inc.
http://uche.ogbuji.net    http://4Suite.org    http://fourthought.com
Apache 2.0 API -
http://www-106.ibm.com/developerworks/linux/library/l-apache/
Python&XML column: Tour of Python/XML -
http://www.xml.com/pub/a/2002/09/18/py.html
Python/Web Services column: xmlrpclib -
http://www-106.ibm.com/developerworks/webservices/library/ws-pyth10.html