PEP 393 vs UTF-8 Everywhere
Marko Rauhamaa
marko at pacujo.net
Sun Jan 22 03:34:07 EST 2017
Steve D'Aprano <steve+python at pearwood.info>:
> On Sun, 22 Jan 2017 06:52 am, Marko Rauhamaa wrote:
>> Also, [surrogates] don't exist as Unicode code points. Python
>> shouldn't allow surrogate characters in strings.
>
> Not quite. This is where it gets a bit messy and confusing. The bottom
> line is: surrogates *are* code points, but they aren't *characters*.
All animals are equal, but some animals are more equal than others.
> Strings which contain surrogates are strictly speaking illegal,
> although some programming languages (including Python) allow them.
Python shouldn't allow them.
> The Unicode standard defines surrogates as follows:
> [...]
>
> - Surrogate Code Point. A Unicode code point in the range
> U+D800..U+DFFF. Reserved for use by UTF-16,
The writer of the standard is playing word games, maybe to offer a fig
leaf to Windows, Java et al.
> By the letter of the Unicode standard, [Python] should not do this,
> but nevertheless it does and it appears to do no real harm and have
> some benefit.
I'm afraid Python's choice may lead to exploitable security holes in
Python programs.
>>> py> low = '\uDC37'
>>
>> That should raise a SyntaxError exception.
>
> If Python was strictly conforming, that is correct, but it turns out
> there are some useful things you can do with strings if you allow
> surrogates.
Conceptual confusion is a high price to pay for such tricks.
Marko
More information about the Python-list
mailing list