[Tutor] how are unicode chars represented?

Wed Apr 1 03:08:55 CEST 2009

"Kent Johnson" <kent37 at tds.net> wrote in message 
news:1c2a2c590903310357m682e16acr9d94b12b609939d5 at mail.gmail.com...
On Tue, Mar 31, 2009 at 1:52 AM, Mark Tolonen <metolone+gmane at gmail.com> 
wrote:

>> Unicode is simply code points. How the code points are represented
>> internally is another matter. The below code is from a 16-bit Unicode 
>> build
>> of Python but should look exactly the same on a 32-bit Unicode build;
>> however, the internal representation is different.
>>
>> Python 2.6.1 (r261:67517, Dec 4 2008, 16:51:00) [MSC v.1500 32 bit 
>> (Intel)]
>> on win32
>> Type "help", "copyright", "credits" or "license" for more information.
>>>>>
>>>>> x=u'\U00012345'
>>>>> x.encode('utf8')
>>
>> '\xf0\x92\x8d\x85'
>>
>> However, I wonder if this should be considered a bug. I would think the
>> length of a Unicode string should be the number of code points in the
>> string, which for my string above should be 1. Anyone have a 32-bit 
>> Unicode
>> build of Python handy? This exposes the implementation as UTF-16.
>>>>>
>>>>> len(x)
>>
>> 2
>>>>>
>>>>> x[0]
>>
>> u'\ud808'
>>>>>
>>>>> x[1]
>>
>> u'\udf45'
>
> In standard Python the representation of unicode is 16 bits, without
> correct handling of surrogate pairs (which is what your string
> contains). I think this is called UCS-2, not UTF-16.
>
> There is a a compile switch to enable 32-bit representation of
> unicode. See PEP 261 and the "Internal Representation" section of the
> second link below for more details.
> http://www.python.org/dev/peps/pep-0261/
> http://www.cmlenz.net/archives/2008/07/the-truth-about-unicode-in-python
>
> Kent

My string above is UTF-16 because it *does* handle surrogate pairs.  See 
http://en.wikipedia.org/wiki/UTF-16.  "UCS-2 (2-byte Universal Character 
Set) is an obsolete character encoding which is a predecessor to UTF-16. The 
UCS-2 encoding form is identical to that of UTF-16, except that it *does 
not* support surrogate pairs...".  The single character \U00012345 was 
stored by Python as the surrogate pair \ud808\udf45 and was correctly 
encoded as the 4-byte UTF-8 '\xf0\x92\x8d\x85' in my example.  Also, 
"Because of the technical similarities and upwards compatibility from UCS-2 
to UTF-16, the two encodings are often erroneously conflated and used as if 
interchangeable, so that strings encoded in UTF-16 are sometimes 
misidentified as being encoded in UCS-2."  Python isn't strictly UCS-2 
anymore, but it doesn't completely implement UTF-16 either, since string 
functions return incorrect results for characters outside the BMP.

-Mark