[Tutor] how are unicode chars represented?

Mark Tolonen metolone+gmane at gmail.com
Tue Mar 31 07:52:15 CEST 2009


"Kent Johnson" <kent37 at tds.net> wrote in message 
news:1c2a2c590903300352t2bd3f1a7j5f37703cf1c3b0c at mail.gmail.com...
> On Mon, Mar 30, 2009 at 3:36 AM, spir <denis.spir at free.fr> wrote:
>> Everything is in the title ;-)
>> (Is it kind of integers representing the code point?)
>
> Unicode is represented as 16-bit integers. I'm not sure, but I don't
> think Python has support for surrogate pairs, i.e. characters outside
> the BMP.

Unicode is simply code points.  How the code points are represented 
internally is another matter.  The below code is from a 16-bit Unicode build 
of Python but should look exactly the same on a 32-bit Unicode build; 
however, the internal representation is different.

Python 2.6.1 (r261:67517, Dec  4 2008, 16:51:00) [MSC v.1500 32 bit (Intel)] 
on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> x=u'\U00012345'
>>> x.encode('utf8')
'\xf0\x92\x8d\x85'

However, I wonder if this should be considered a bug.  I would think the 
length of a Unicode string should be the number of code points in the 
string, which for my string above should be 1.  Anyone have a 32-bit Unicode 
build of Python handy?  This exposes the implementation as UTF-16.
>>> len(x)
2
>>> x[0]
u'\ud808'
>>> x[1]
u'\udf45'
>>>

-Mark




More information about the Tutor mailing list