[Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

Cameron Simpson cs at zip.com.au
Wed Apr 29 05:27:40 CEST 2009


On 28Apr2009 13:37, Glenn Linderman <v+python at g.nevcal.com> wrote:
> On approximately 4/28/2009 1:25 PM, came the following characters from  
> the keyboard of Martin v. Löwis:
>>> The UTF-8b representation suffers from the same potential ambiguities as
>>> the PUA characters... 
>>
>> Not at all the same ambiguities. Here, again, the two choices:
>>
>> A. use PUA characters to represent undecodable bytes, in particular for
>>    UTF-8 (the PEP actually never proposed this to happen).
>>    This introduces an ambiguity: two different files in the same
>>    directory may decode to the same string name, if one has the PUA
>>    character, and the other has a non-decodable byte that gets decoded
>>    to the same PUA character.
>>
>> B. use UTF-8b, representing the byte will ill-formed surrogate codes.
>>    The same ambiguity does *NOT* exist. If a file on disk already
>>    contains an invalid surrogate code in its file name, then the UTF-8b
>>    decoder will recognize this as invalid, and decode it byte-for-byte,
>>    into three surrogate codes. Hence, the file names that are different
>>    on disk are also different in memory. No ambiguity.
>
> C. File on disk with the invalid surrogate code, accessed via the str  
> interface, no decoding happens, matches in memory the file on disk with  
> the byte that translates to the same surrogate, accessed via the bytes  
> interface.  Ambiguity.

Is this a Windows example, or (now I think on it) an equivalent POSIX example
of using the PEP where the locale encoding is UTF-16?

In either case, I would say one could make an argument for being stricter
in reading in OS-native sequences. Grant that NTFS doesn't prevent
half-surrogates in filenames, and likewise that POSIX won't because to
the OS they're just bytes. On decoding, require well-formed data. When
you hit ill-formed data, treat the nasty half surrogate as a PAIR of
bytes to be escaped in the resulting decode.

Ambiguity avoided.

I'm more concerned with your (yours? someone else's?) mention of shift
characters. I'm unfamiliar with these encodings: to translate such a
thing into a Latin example, is it the case that there are schemes with
valid encodings that look like:

  [SHIFT] a b c

which would produce "ABC" in unicode, which is ambiguous with:

  A B C

which would also produce "ABC"?

Cheers,
-- 
Cameron Simpson <cs at zip.com.au> DoD#743
http://www.cskk.ezoshosting.com/cs/

Helicopters are considerably more expensive [than fixed wing aircraft],
which is only right because they don't actually fly, but just beat
the air into submission.        - Paul Tomblin


More information about the Python-Dev mailing list