[Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
Glenn Linderman
v+python at g.nevcal.com
Wed Apr 29 00:52:22 CEST 2009
On approximately 4/28/2009 2:02 PM, came the following characters from
the keyboard of Martin v. Löwis:
> Glenn Linderman wrote:
>> On approximately 4/28/2009 1:25 PM, came the following characters from
>> the keyboard of Martin v. Löwis:
>>>> The UTF-8b representation suffers from the same potential ambiguities as
>>>> the PUA characters...
>>> Not at all the same ambiguities. Here, again, the two choices:
>>>
>>> A. use PUA characters to represent undecodable bytes, in particular for
>>> UTF-8 (the PEP actually never proposed this to happen).
>>> This introduces an ambiguity: two different files in the same
>>> directory may decode to the same string name, if one has the PUA
>>> character, and the other has a non-decodable byte that gets decoded
>>> to the same PUA character.
>>>
>>> B. use UTF-8b, representing the byte will ill-formed surrogate codes.
>>> The same ambiguity does *NOT* exist. If a file on disk already
>>> contains an invalid surrogate code in its file name, then the UTF-8b
>>> decoder will recognize this as invalid, and decode it byte-for-byte,
>>> into three surrogate codes. Hence, the file names that are different
>>> on disk are also different in memory. No ambiguity.
>> C. File on disk with the invalid surrogate code, accessed via the str
>> interface, no decoding happens, matches in memory the file on disk with
>> the byte that translates to the same surrogate, accessed via the bytes
>> interface. Ambiguity.
>
> Is that an alternative to A and B?
I guess it is an adjunct to case B, the current PEP.
It is what happens when using the PEP on a system that provides both
bytes and str interfaces, and both get used.
On a Windows system, perhaps the ambiguous case would be the use of the
str API and bytes APIs producing different memory names for the same
file that contains a (Unicode-illegal) half surrogate. The
half-surrogate would seem to get decoded to 3 half surrogates if
accessed via the bytes interface, but only one via the str interface.
The version with 3 half surrogates could match another name that
actually contains 3 half surrogates, that is accessed via the str interface.
I can't actually tell by reading the PEP whether it affects Windows
bytes interfaces or is only implemented on POSIX, so that POSIX has a
str interface.
If it is only implemented on POSIX, then the current scheme (now
escaping the hundreds of escape codes) could work, within a single
platform... but it would still suffer from displaying garbage (sequences
of replacement characters) in file listings displayed or printed. There
is no way, once the string is adjusted to contain replacement characters
for display, to distinguish one file name from another, if they are
identical except for a same-length sequence of different undecodable bytes.
The concept of a function that allows the same decoding and encoding
process for 3rd party interfaces is still missing from the PEP;
implementation of the PEP would require that all interfaces to 3rd party
software that accept file names would have to be transcoded by the
interface layer. Or else such software would have to use the bytes
interfaces directly, and if they do, there is no need for the PEP.
So I see the PEP as a partial solution to a limited problem, that on the
one hand potentially produces indistinguishable sequences of replacement
characters in filenames, rather than the mojibake (which is at least
distinguishable), and on the other hand, doesn't help software that also
uses 3rd party libraries to avoid the use of bytes APIs for accessing
file names. There are other encodings that produce more distinguishable
mojibake, and would work in the same situations as the PEP.
--
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
More information about the Python-Dev
mailing list