PEP 393 vs UTF-8 Everywhere
steve+python at pearwood.info
Sun Jan 22 09:01:32 EST 2017
On Sun, 22 Jan 2017 07:34 pm, Marko Rauhamaa wrote:
> Steve D'Aprano <steve+python at pearwood.info>:
>> On Sun, 22 Jan 2017 06:52 am, Marko Rauhamaa wrote:
>>> Also, [surrogates] don't exist as Unicode code points. Python
>>> shouldn't allow surrogate characters in strings.
>> Not quite. This is where it gets a bit messy and confusing. The bottom
>> line is: surrogates *are* code points, but they aren't *characters*.
> All animals are equal, but some animals are more equal than others.
>> Strings which contain surrogates are strictly speaking illegal,
>> although some programming languages (including Python) allow them.
> Python shouldn't allow them.
That's one opinion.
>> The Unicode standard defines surrogates as follows:
>> - Surrogate Code Point. A Unicode code point in the range
>> U+D800..U+DFFF. Reserved for use by UTF-16,
> The writer of the standard is playing word games, maybe to offer a fig
> leaf to Windows, Java et al.
>> By the letter of the Unicode standard, [Python] should not do this,
>> but nevertheless it does and it appears to do no real harm and have
>> some benefit.
> I'm afraid Python's choice may lead to exploitable security holes in
> Python programs.
Feel free to back up that with an actual demonstration of an exploit, rather
than just FUD.
>>>> py> low = '\uDC37'
>>> That should raise a SyntaxError exception.
>> If Python was strictly conforming, that is correct, but it turns out
>> there are some useful things you can do with strings if you allow
> Conceptual confusion is a high price to pay for such tricks.
There's a lot to comprehend about Unicode. I don't see that Python's
non-strict implementation is harder to understand than the strict version.
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.
More information about the Python-list