Variables versus name bindings [Re: A certainl part of an if() structure never gets executed.]
Steven D'Aprano
steve+comp.lang.python at pearwood.info
Tue Jun 18 02:04:36 EDT 2013
On Tue, 18 Jun 2013 00:12:34 -0400, Dave Angel wrote:
> On 06/17/2013 10:42 PM, Steven D'Aprano wrote:
>> On Mon, 17 Jun 2013 21:06:57 -0400, Dave Angel wrote:
>>
>>> On 06/17/2013 08:41 PM, Steven D'Aprano wrote:
>>>>
>>>> <SNIP>
>>>>
>>>> In Python 3.2 and older, the data will be either UTF-4 or UTF-8,
>>>> selected when the Python compiler itself is compiled.
>>>
>>> I think that was a typo. Do you perhaps UCS-2 or UCS-4
>>
>> Yes, that would be better.
>>
>> UCS-2 is identical to UTF-16, except it doesn't support non-BMP
>> characters and therefore doesn't have surrogate pairs.
>>
>> UCS-4 is functionally equivalent to UTF-16,
>
> Perhaps you mean UTF-32 ?
Yes, sorry for the repeated confusion.
>> as far as I can tell. (I'm
>> not really sure what the difference is.)
>>
>>
> Now you've got me curious, by bringing up surrogate pairs. Do you know
> whether a narrow build (say 3.2) really works as UTF16, so when you
> encode a surrogate pair (4 bytes) to UTF-8, it encodes a single Unicode
> character into a single UTF-8 sequence (prob. 4 bytes long) ?
In a Python narrow build, the internal storage of strings is equivalent
to UTF-16: all characters in the Basic Multilingual Plane require two
bytes:
py> sys.maxunicode
65535
py> sys.getsizeof('π') - sys.getsizeof('')
2
Outside of the BMP, characters are treated as a pair of surrogates:
py> c = chr(0x10F000) # one character...
py> len(c) # ...stored as a pair of surrogates
2
Encoding and decoding works fine:
py> c.encode('utf-8').decode('utf-8') == c
True
py> c.encode('utf-8')
b'\xf4\x8f\x80\x80'
The problem with surrogates is that it is possible to accidentally
separate the pair, which leads to broken, invalid text:
py> c[0].encode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\udbfc' in
position 0: surrogates not allowed
(The error message is a little misleading; surrogates are allowed, but
only if they make up a valid pair.)
Python's handling of UTF-16 is, as far as I know, correct. What isn't
correct is that the high-level Python string methods assume that two
bytes == one character, which can lead to surrogates being separated,
which gives you junk text. Wide builds don't have this problem, because
every character == four bytes, and neither does Python 3.
--
Steven
More information about the Python-list
mailing list