[Tutor] how are unicode chars represented?
Mark Tolonen
metolone+gmane at gmail.com
Tue Mar 31 07:52:15 CEST 2009
"Kent Johnson" <kent37 at tds.net> wrote in message
news:1c2a2c590903300352t2bd3f1a7j5f37703cf1c3b0c at mail.gmail.com...
> On Mon, Mar 30, 2009 at 3:36 AM, spir <denis.spir at free.fr> wrote:
>> Everything is in the title ;-)
>> (Is it kind of integers representing the code point?)
>
> Unicode is represented as 16-bit integers. I'm not sure, but I don't
> think Python has support for surrogate pairs, i.e. characters outside
> the BMP.
Unicode is simply code points. How the code points are represented
internally is another matter. The below code is from a 16-bit Unicode build
of Python but should look exactly the same on a 32-bit Unicode build;
however, the internal representation is different.
Python 2.6.1 (r261:67517, Dec 4 2008, 16:51:00) [MSC v.1500 32 bit (Intel)]
on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> x=u'\U00012345'
>>> x.encode('utf8')
'\xf0\x92\x8d\x85'
However, I wonder if this should be considered a bug. I would think the
length of a Unicode string should be the number of code points in the
string, which for my string above should be 1. Anyone have a 32-bit Unicode
build of Python handy? This exposes the implementation as UTF-16.
>>> len(x)
2
>>> x[0]
u'\ud808'
>>> x[1]
u'\udf45'
>>>
-Mark
More information about the Tutor
mailing list