PEP 393 vs UTF-8 Everywhere
Marko Rauhamaa
marko at pacujo.net
Sun Jan 22 10:19:34 EST 2017
Steve D'Aprano <steve+python at pearwood.info>:
> On Sun, 22 Jan 2017 07:34 pm, Marko Rauhamaa wrote:
>
>> Steve D'Aprano <steve+python at pearwood.info>:
>>
>>> On Sun, 22 Jan 2017 06:52 am, Marko Rauhamaa wrote:
>>>> Also, [surrogates] don't exist as Unicode code points. Python
>>>> shouldn't allow surrogate characters in strings.
>>>
>>> Not quite. This is where it gets a bit messy and confusing. The
>>> bottom line is: surrogates *are* code points, but they aren't
>>> *characters*.
>>
>> All animals are equal, but some animals are more equal than others.
>
> Huh?
There is no difference between 0xD800 and 0xD8000000. They are both
numbers that don't--and won't--represent anything in Unicode. It's
pointless to call one a "code point" and not the other one. A code point
that isn't code for anything can barely be called a code point.
I'm guessing 0xD800 is called a code point because it was always called
that. It was dropped out when UTF-16 was invented but they didn't want
to "demote" the number retroactively, especially since Windows and Java
already were allowing them in strings.
>>> By the letter of the Unicode standard, [Python] should not do this,
>>> but nevertheless it does and it appears to do no real harm and have
>>> some benefit.
>>
>> I'm afraid Python's choice may lead to exploitable security holes in
>> Python programs.
>
> Feel free to back up that with an actual demonstration of an exploit,
> rather than just FUD.
It might come as a surprise to programmers that pathnames cannot be
UTF-encoded or displayed. Also, those situations might not show up
during testing but only with appropriately crafted input.
Marko
More information about the Python-list
mailing list