[XML-SIG] Re: Issues with Unicode type
Uche Ogbuji
uche.ogbuji@fourthought.com
Mon, 23 Sep 2002 14:04:02 -0600
> Uche Ogbuji writes:
> > No. A surrogate pair is one character. It takes up 2 16-bit values,
> > but this is not the same as taking up 2 characters. The whole point of
> > a variable-length encoding such as UTF-16 is that the number of storage
> > values is not always the same as the number of characters.
>
> Yes, I'm aware of that. The problem is one of me being sloppy in the
> use of the word 'character'.
Ah. I wasn't meaning to leap too hard on that. I thhought we had a genuine
misunderstanding on tis.
> > Yes. Don't you see that this means that the behavior as compiled with
> > UTF-16 is wrong from a *character set* point of view? The same code
> > point is *one* character whether encoded in UTF-7, UTF-8, UTF-16,
> > UTF-32, UCS-2, UCS-4, etc. It is never more than one character.
>
> Sure, but the *implementation* within the Python interpreter is
> treating characters in the astral planes as two 16-bit words, not
> one. The len() value that you get is the number of UTF-16-encoded
> words in the string. There was a very long, very drawn out discussion
> on the representation of Unicode characters in Python a while back on
> the python-i18n mailing list where this whole thing was beaten to
> death and which eventually lead to the option to compile the
> interpreter to use a 32-bit character representation.
Yes. I'm learning about all this, and learning a lot that I would probably
have preferred to be blissfully ignorant of :-(
Thanks.
--
Uche Ogbuji Fourthought, Inc.
http://uche.ogbuji.net http://4Suite.org http://fourthought.com
Apache 2.0 API - http://www-106.ibm.com/developerworks/linux/library/l-apache/
Python&XML column: Tour of Python/XML - http://www.xml.com/pub/a/2002/09/18/py.
html
Python/Web Services column: xmlrpclib - http://www-106.ibm.com/developerworks/w
ebservices/library/ws-pyth10.html