PEP 393 vs UTF-8 Everywhere
steve+python at pearwood.info
Sun Jan 22 21:14:20 EST 2017
On Mon, 23 Jan 2017 02:19 am, Marko Rauhamaa wrote:
> Steve D'Aprano <steve+python at pearwood.info>:
>> On Sun, 22 Jan 2017 07:34 pm, Marko Rauhamaa wrote:
>>> Steve D'Aprano <steve+python at pearwood.info>:
>>>> On Sun, 22 Jan 2017 06:52 am, Marko Rauhamaa wrote:
>>>>> Also, [surrogates] don't exist as Unicode code points. Python
>>>>> shouldn't allow surrogate characters in strings.
>>>> Not quite. This is where it gets a bit messy and confusing. The
>>>> bottom line is: surrogates *are* code points, but they aren't
>>> All animals are equal, but some animals are more equal than others.
> There is no difference between 0xD800 and 0xD8000000.
py> 0xD800 == 0xD8000000
> They are both
> numbers that don't--and won't--represent anything in Unicode.
Your use of hex notation 0x... indicates that you're talking about code
units rather than U+... code points. The first one 0xD800 could be:
- a Little Endian double-byte code unit for 'Ø' in either UCS-2 or UTF-16;
- a Big Endian double-byte code unit that has no special meaning in UCS-2;
- one half of a surrogate pair (two double-byte code units) in Big Endian
UTF-16, encoding some unknown supplementary code point.
The second one 0xD8000000 could be:
- a C long (four-byte int) 3623878656, which is out of range for Big Endian
UCS-4 or UTF-32;
- the Little Endian four-byte code unit for 'Ø' in either UCS-4 or UTF-32.
> It's pointless to call one a "code point" and not the other one.
Neither of them are code points. You're confusing the concrete
representation with the abstract character.
Perhaps you meant to compare the code point U+D800 to, well, there's no
comparison to be made, because "U+D8000000" is not valid and is completely
out of range. The largest code point is U+10FFFF.
> A code point
> that isn't code for anything can barely be called a code point.
It does have a purpose. Or even more than one.
- It ensures that there is a one-to-one mapping between code points and
code units in any specific encoding and byte-order.
- By reserving those code points, it ensures that they cannot be
accidentally used by the standard for something else.
- It makes it easier to talk about the entities: "U+D800 is a surrogate
code point reserved for UTF-16 surrogates", as opposed to "U+D800 isn't
anything, but if it was something, it would be a code point reserved
for UTF-16 surrogates".
- Or worse, forcing us to talk in terms of code units (implementation)
instead of abstract characters, which is painfully verbose:
"0xD800 in Big Endian UTF-16, or 0x00D8 in Little Endian UTF-16, or
0x0000D800 in Big Endian UTF-32, or 0x00D80000 in Little Endian
UTF-16, doesn't map to any code point but is reserved for UTF-16
And, an entirely unforeseen purpose:
- It allows languages like Python to (ab)use surrogate code points for
round-tripping file names which aren't valid Unicode.
>>> I'm afraid Python's choice may lead to exploitable security holes in
>>> Python programs.
>> Feel free to back up that with an actual demonstration of an exploit,
>> rather than just FUD.
> It might come as a surprise to programmers that pathnames cannot be
> UTF-encoded or displayed.
Many things come as surprises to programmers, and many pathnames cannot be
To be precise, Mac OS requires pathnames to be both valid and normalised
UTF-8, and it would be nice if that practice spread. But Windows only
requires pathnames to consist of UCS-2 code points, and Linux pathnames are
arbitrary bytes that may include characters which are illegal on Windows.
So you don't need to involve surrogates to have undecodable pathnames.
> Also, those situations might not show up
> during testing but only with appropriately crafted input.
I'm not seeing a security exploit here.
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.
More information about the Python-list