PEP 393 vs UTF-8 Everywhere
Steve D'Aprano
steve+python at pearwood.info
Sun Jan 22 09:01:32 EST 2017
On Sun, 22 Jan 2017 07:34 pm, Marko Rauhamaa wrote:
> Steve D'Aprano <steve+python at pearwood.info>:
>
>> On Sun, 22 Jan 2017 06:52 am, Marko Rauhamaa wrote:
>>> Also, [surrogates] don't exist as Unicode code points. Python
>>> shouldn't allow surrogate characters in strings.
>>
>> Not quite. This is where it gets a bit messy and confusing. The bottom
>> line is: surrogates *are* code points, but they aren't *characters*.
>
> All animals are equal, but some animals are more equal than others.
Huh?
>> Strings which contain surrogates are strictly speaking illegal,
>> although some programming languages (including Python) allow them.
>
> Python shouldn't allow them.
That's one opinion.
>> The Unicode standard defines surrogates as follows:
>> [...]
>>
>> - Surrogate Code Point. A Unicode code point in the range
>> U+D800..U+DFFF. Reserved for use by UTF-16,
>
> The writer of the standard is playing word games, maybe to offer a fig
> leaf to Windows, Java et al.
Seriously?
>> By the letter of the Unicode standard, [Python] should not do this,
>> but nevertheless it does and it appears to do no real harm and have
>> some benefit.
>
> I'm afraid Python's choice may lead to exploitable security holes in
> Python programs.
Feel free to back up that with an actual demonstration of an exploit, rather
than just FUD.
>>>> py> low = '\uDC37'
>>>
>>> That should raise a SyntaxError exception.
>>
>> If Python was strictly conforming, that is correct, but it turns out
>> there are some useful things you can do with strings if you allow
>> surrogates.
>
> Conceptual confusion is a high price to pay for such tricks.
There's a lot to comprehend about Unicode. I don't see that Python's
non-strict implementation is harder to understand than the strict version.
--
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.
More information about the Python-list
mailing list