[XML-SIG] Re: Issues with Unicode type
Uche Ogbuji
uche.ogbuji@fourthought.com
Mon, 23 Sep 2002 16:41:58 -0600
> However that is really two characters 0x1080 and 0x0030. \u (lowercase)
> only takes 4 hex digits. \U (uppercase) takes 8 digits. So to create the
> character 0x10800, the sequence should be u'\U0010800'.
Right, Jeremy. I wasn't squinting hard enough at Daniel's example. In my own
examples, I've been using
u"\U00010000"
or
u"\uD800\uDC00"
These are actually equivalent if Python is compiled for UTF-16 encoding: In
the top example, Python breaks the full code point into its UTF-16
representation, and so ends up with the same internal object as the second
form.
I'm not sure whether they would be equivalent if Python is compiled for UCS-4
(BTW, there is no diff between UTF-32 and UCS-4, is there?). I would imagine
Python would blindly create 2 pseudo code points D800 and DC00. I say
"pseudo" since, because these values are in the surrogate blocks, they are not
valid characters in themselves.
Which leads me to believe that even though u"\uD800\uDC00" would be treated
equivalently to u"\U00010000" as long as Python is compiled for UTF-16, that
it is a *very* bad idea to write unicode literals that way.
I'm learning a lot today :-)
--
Uche Ogbuji Fourthought, Inc.
http://uche.ogbuji.net http://4Suite.org http://fourthought.com
Apache 2.0 API - http://www-106.ibm.com/developerworks/linux/library/l-apache/
Python&XML column: Tour of Python/XML - http://www.xml.com/pub/a/2002/09/18/py.
html
Python/Web Services column: xmlrpclib - http://www-106.ibm.com/developerworks/w
ebservices/library/ws-pyth10.html