PEP 393 vs UTF-8 Everywhere

Sun Jan 22 09:01:32 EST 2017

On Sun, 22 Jan 2017 07:34 pm, Marko Rauhamaa wrote:

> Steve D'Aprano <steve+python at pearwood.info>:
> 
>> On Sun, 22 Jan 2017 06:52 am, Marko Rauhamaa wrote:
>>> Also, [surrogates] don't exist as Unicode code points. Python
>>> shouldn't allow surrogate characters in strings.
>>
>> Not quite. This is where it gets a bit messy and confusing. The bottom
>> line is: surrogates *are* code points, but they aren't *characters*.
> 
> All animals are equal, but some animals are more equal than others.

Huh?

>> Strings which contain surrogates are strictly speaking illegal,
>> although some programming languages (including Python) allow them.
> 
> Python shouldn't allow them.

That's one opinion.

>> The Unicode standard defines surrogates as follows:
>> [...]
>>
>> - Surrogate Code Point. A Unicode code point in the range
>>   U+D800..U+DFFF. Reserved for use by UTF-16,
> 
> The writer of the standard is playing word games, maybe to offer a fig
> leaf to Windows, Java et al.

Seriously?

>> By the letter of the Unicode standard, [Python] should not do this,
>> but nevertheless it does and it appears to do no real harm and have
>> some benefit.
> 
> I'm afraid Python's choice may lead to exploitable security holes in
> Python programs.

Feel free to back up that with an actual demonstration of an exploit, rather
than just FUD.

>>>> py> low = '\uDC37'
>>> 
>>> That should raise a SyntaxError exception.
>>
>> If Python was strictly conforming, that is correct, but it turns out
>> there are some useful things you can do with strings if you allow
>> surrogates.
> 
> Conceptual confusion is a high price to pay for such tricks.

There's a lot to comprehend about Unicode. I don't see that Python's
non-strict implementation is harder to understand than the strict version.

-- 
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.