[XML-SIG] Re: Issues with Unicode type
Uche Ogbuji
uche.ogbuji@fourthought.com
23 Sep 2002 13:27:22 -0600
On Mon, 2002-09-23 at 11:58, Tom Emerson wrote:
> Uche Ogbuji writes:
> > IIRC, UTF-16 supports the representation of characters outside the BMP by
> > using surrogate pairs (SP). If so, then the scary solution of requiring XML
> > users to compile Python to use UCS-4 can be put aside.
>
> Yes, that is what I (thought I) said in my previous response: since
> internally Python is representing characters outside the BMP as a
> surrogate pair in UTF-16, the length of a Unicode string using these
> characters is 2 --- two UTF-16 characters.
No. A surrogate pair is one character. It takes up 2 16-bit values,
but this is not the same as taking up 2 characters. The whole point of
a variable-length encoding such as UTF-16 is that the number of storage
values is not always the same as the number of characters.
Eric found this message where Guido does a decent job of summarizing the
various issues, though I'm not sure I agree with his conclusion:
http://mail.python.org/pipermail/i18n-sig/2001-June/001107.html
I should note that based on code Eric found in James Clark's code, Java
doesn't treat surrogates specially internally, either, which I guess
tends to bolster Guido's POV :-(
> > The question would then be how to get a surrogate pair into a Python unicode
> > object. On a hunch, I tried:
> >
> > >>> c = u"\uD800\uDC00"
> > >>> len(c)
> > 2
>
> That works. You can also use \U notation:
No. My whole point is that it didn't work. len(c) would be 1, not 2 if
the characters were properly treated as a surrogate pair.
> >>> c = u"\U00010000"
> >>> len(c)
> 2
> >>> c
> u'\u00010000'
> >>> c[0]
> u'\ud800'
> >>> c[1]
> u'\udc00'
>
> If you compile your Python installation to use "wide" Unicode
> characters (i.e., UTF-32), then I expect the behavior to be
>
> >>> c = u"\U00010000"
> >>> len(c)
> 1
> >>> len(c)
> u'\U00010000'
Yes. Don't you see that this means that the behavior as compiled with
UTF-16 is wrong from a *character set* point of view? The same code
point is *one* character whether encoded in UTF-7, UTF-8, UTF-16,
UTF-32, UCS-2, UCS-4, etc. It is never more than one character.
--
Uche Ogbuji Fourthought, Inc.
http://uche.ogbuji.net http://4Suite.org http://fourthought.com
Apache 2.0 API -
http://www-106.ibm.com/developerworks/linux/library/l-apache/
Python&XML column: Tour of Python/XML -
http://www.xml.com/pub/a/2002/09/18/py.html
Python/Web Services column: xmlrpclib -
http://www-106.ibm.com/developerworks/webservices/library/ws-pyth10.html