Variables versus name bindings [Re: A certainl part of an if() structure never gets executed.]

Steven D'Aprano steve+comp.lang.python at pearwood.info
Tue Jun 18 02:04:36 EDT 2013


On Tue, 18 Jun 2013 00:12:34 -0400, Dave Angel wrote:

> On 06/17/2013 10:42 PM, Steven D'Aprano wrote:
>> On Mon, 17 Jun 2013 21:06:57 -0400, Dave Angel wrote:
>>
>>> On 06/17/2013 08:41 PM, Steven D'Aprano wrote:
>>>>
>>>>      <SNIP>
>>>>
>>>> In Python 3.2 and older, the data will be either UTF-4 or UTF-8,
>>>> selected when the Python compiler itself is compiled.
>>>
>>> I think that was a typo.  Do you perhaps UCS-2 or UCS-4
>>
>> Yes, that would be better.
>>
>> UCS-2 is identical to UTF-16, except it doesn't support non-BMP
>> characters and therefore doesn't have surrogate pairs.
>>
>> UCS-4 is functionally equivalent to UTF-16,
> 
> Perhaps you mean UTF-32 ?


Yes, sorry for the repeated confusion.


>>   as far as I can tell. (I'm
>> not really sure what the difference is.)
>>
>>
> Now you've got me curious, by bringing up surrogate pairs.  Do you know
> whether a narrow build (say 3.2) really works as UTF16, so when you
> encode a surrogate pair (4 bytes) to UTF-8, it encodes a single Unicode
> character into a single UTF-8 sequence (prob.  4 bytes long) ?

In a Python narrow build, the internal storage of strings is equivalent 
to UTF-16: all characters in the Basic Multilingual Plane require two 
bytes:

py> sys.maxunicode
65535
py> sys.getsizeof('π') - sys.getsizeof('')
2

Outside of the BMP, characters are treated as a pair of surrogates:

py> c = chr(0x10F000)  # one character...
py> len(c)  # ...stored as a pair of surrogates
2

Encoding and decoding works fine:

py> c.encode('utf-8').decode('utf-8') == c
True
py> c.encode('utf-8')
b'\xf4\x8f\x80\x80'


The problem with surrogates is that it is possible to accidentally 
separate the pair, which leads to broken, invalid text:

py> c[0].encode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\udbfc' in 
position 0: surrogates not allowed


(The error message is a little misleading; surrogates are allowed, but 
only if they make up a valid pair.)


Python's handling of UTF-16 is, as far as I know, correct. What isn't 
correct is that the high-level Python string methods assume that two 
bytes == one character, which can lead to surrogates being separated, 
which gives you junk text. Wide builds don't have this problem, because 
every character == four bytes, and neither does Python 3.



-- 
Steven



More information about the Python-list mailing list