[Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

Wed Apr 29 00:52:22 CEST 2009

On approximately 4/28/2009 2:02 PM, came the following characters from 
the keyboard of Martin v. Löwis:
> Glenn Linderman wrote:
>> On approximately 4/28/2009 1:25 PM, came the following characters from
>> the keyboard of Martin v. Löwis:
>>>> The UTF-8b representation suffers from the same potential ambiguities as
>>>> the PUA characters... 
>>> Not at all the same ambiguities. Here, again, the two choices:
>>>
>>> A. use PUA characters to represent undecodable bytes, in particular for
>>>    UTF-8 (the PEP actually never proposed this to happen).
>>>    This introduces an ambiguity: two different files in the same
>>>    directory may decode to the same string name, if one has the PUA
>>>    character, and the other has a non-decodable byte that gets decoded
>>>    to the same PUA character.
>>>
>>> B. use UTF-8b, representing the byte will ill-formed surrogate codes.
>>>    The same ambiguity does *NOT* exist. If a file on disk already
>>>    contains an invalid surrogate code in its file name, then the UTF-8b
>>>    decoder will recognize this as invalid, and decode it byte-for-byte,
>>>    into three surrogate codes. Hence, the file names that are different
>>>    on disk are also different in memory. No ambiguity.
>> C. File on disk with the invalid surrogate code, accessed via the str
>> interface, no decoding happens, matches in memory the file on disk with
>> the byte that translates to the same surrogate, accessed via the bytes
>> interface.  Ambiguity.
> 
> Is that an alternative to A and B?

I guess it is an adjunct to case B, the current PEP.

It is what happens when using the PEP on a system that provides both 
bytes and str interfaces, and both get used.

On a Windows system, perhaps the ambiguous case would be the use of the 
str API and bytes APIs producing different memory names for the same 
file that contains a (Unicode-illegal) half surrogate.  The 
half-surrogate would seem to get decoded to 3 half surrogates if 
accessed via the bytes interface, but only one via the str interface. 
The version with 3 half surrogates could match another name that 
actually contains 3 half surrogates, that is accessed via the str interface.

I can't actually tell by reading the PEP whether it affects Windows 
bytes interfaces or is only implemented on POSIX, so that POSIX has a 
str interface.

If it is only implemented on POSIX, then the current scheme (now 
escaping the hundreds of escape codes) could work, within a single 
platform... but it would still suffer from displaying garbage (sequences 
of replacement characters) in file listings displayed or printed.  There 
is no way, once the string is adjusted to contain replacement characters 
for display, to distinguish one file name from another, if they are 
identical except for a same-length sequence of different undecodable bytes.

The concept of a function that allows the same decoding and encoding 
process for 3rd party interfaces is still missing from the PEP; 
implementation of the PEP would require that all interfaces to 3rd party 
software that accept file names would have to be transcoded by the 
interface layer.  Or else such software would have to use the bytes 
interfaces directly, and if they do, there is no need for the PEP.

So I see the PEP as a partial solution to a limited problem, that on the 
one hand potentially produces indistinguishable sequences of replacement 
characters in filenames, rather than the mojibake (which is at least 
distinguishable), and on the other hand, doesn't help software that also 
uses 3rd party libraries to avoid the use of bytes APIs for accessing 
file names.  There are other encodings that produce more distinguishable 
mojibake, and would work in the same situations as the PEP.

-- 
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking