[XML-SIG] Re: Issues with Unicode type

Mon, 23 Sep 2002 15:29:06 -0400

Uche Ogbuji writes:
> No.  A surrogate pair is one character.  It takes up 2 16-bit values,
> but this is not the same as taking up 2 characters.  The whole point of
> a variable-length encoding such as UTF-16 is that the number of storage
> values is not always the same as the number of characters.

Yes, I'm aware of that. The problem is one of me being sloppy in the
use of the word 'character'.

> Yes.  Don't you see that this means that the behavior as compiled with
> UTF-16 is wrong from a *character set* point of view?  The same code
> point is *one* character whether encoded in UTF-7, UTF-8, UTF-16,
> UTF-32, UCS-2, UCS-4, etc.  It is never more than one character.

Sure, but the *implementation* within the Python interpreter is
treating characters in the astral planes as two 16-bit words, not
one. The len() value that you get is the number of UTF-16-encoded
words in the string. There was a very long, very drawn out discussion
on the representation of Unicode characters in Python a while back on
the python-i18n mailing list where this whole thing was beaten to
death and which eventually lead to the option to compile the
interpreter to use a 32-bit character representation.

> 
> -- 
> Uche Ogbuji                                    Fourthought, Inc.
> http://uche.ogbuji.net    http://4Suite.org    http://fourthought.com
> Apache 2.0 API -
> http://www-106.ibm.com/developerworks/linux/library/l-apache/
> Python&XML column: Tour of Python/XML -
> http://www.xml.com/pub/a/2002/09/18/py.html
> Python/Web Services column: xmlrpclib -
> http://www-106.ibm.com/developerworks/webservices/library/ws-pyth10.html

-- 
Tom Emerson                                          Basis Technology Corp.
Software Architect                                 http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"