[Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

Tue Apr 28 03:15:17 CEST 2009

On approximately 4/27/2009 2:14 PM, came the following characters from 
the keyboard of Cameron Simpson:
> On 27Apr2009 00:07, Glenn Linderman <v+python at g.nevcal.com> wrote:
>   
>> On approximately 4/25/2009 5:22 AM, came the following characters from  
>> the keyboard of Martin v. Löwis:
>>     
>>>> The problem with this, and other preceding schemes that have been
>>>> discussed here, is that there is no means of ascertaining whether a
>>>> particular file name str was obtained from a str API, or was funny-
>>>> decoded from a bytes API... and thus, there is no means of reliably
>>>> ascertaining whether a particular filename str should be passed to a
>>>> str API, or funny-encoded back to bytes.
>>>>         
>>> Why is it necessary that you are able to make this distinction?
>>>       
>> It is necessary that programs (not me) can make the distinction, so that  
>> it knows whether or not to do the funny-encoding or not.
>>     
>
> I would say this isn't so. It's important that programs know if they're
> dealing with strings-for-filenames, but not that they be able to figure
> that out "a priori" if handed a bare string (especially since they
> can't:-)
>   

So you agree they can't... that there are data puns.   (OK, you may not 
have thought that through)

>> If a name is  
>> funny-decoded when the name is accessed by a directory listing, it needs  
>> to be funny-encoded in order to open the file.
>>     
>
> Hmm. I had thought that legitimate unicode strings already get transcoded
> to bytes via the mapping specified by sys.getfilesystemencoding()
> (the user's locale). That already happens I believe, and Martin's
> scheme doesn't change this. He's just funny-encoding non-decodable byte
> sequences, not the decoded stuff that surrounds them.
>   

So assume a non-decodable sequence in a name.  That puts us into 
Martin's funny-decode scheme.  His funny-decode scheme produces a bare 
string, indistinguishable from a bare string that would be produced by a 
str API that happens to contain that same sequence.  Data puns.

So when open is handed the string, should it open the file with the name 
that matches the string, or the file with the name that funny-decodes to 
the same string?  It can't know, unless it knows that the string is a 
funny-decoded string or not.

> So it is already the case that strings get decoded to bytes by
> calls like open(). Martin isn't changing that.
>   

I thought the process of converting strings to bytes is called 
encoding.  You seem to be calling it decoding?

> I suppose if your program carefully constructs a unicode string riddled
> with half-surrogates etc and imagines something specific should happen
> to them on the way to being POSIX bytes then you might have a problem...
>   

Right.  Or someone else's program does that.  I only want to use Unicode 
file names.  But if those other file names exist, I want to be able to 
access them, and not accidentally get a different file.

> I think the advantage to Martin's choice of encoding-for-undecodable-bytes
> is that it _doesn't_ use normal characters for the special bits. This
> means that _all_ normal characters are left unmangled un both "bare"
> and "funny-encoded" strings.
>   

Whether the characters used for funny decoding are normal or abnormal, 
unless they are prevented from also appearing in filenames when they are 
obtained from or passed to other APIs, there is the possibility that the 
funny-decoded name also exists in the filesystem by the funny-decoded 
name... a data pun on the name.

Whether the characters used for funny decoding are normal or abnormal, 
if they are not prevented from also appearing in filenames when they are 
obtained from or passed to other APIs, then in order to prevent data 
puns, *all* names must be passed through the decoder, and the decoder 
must perform a 1-to-1 reversible mapping.  Martin's funny-decode process 
does not perform a 1-to-1 reversible mapping (unless he's changed it 
from the version of the PEP I found to read).

This is why some people have suggested using the null character for the 
decoding, because it and / can't appear in POSIX file names, but 
everything else can.  But that makes it really hard to display the 
funny-decoded characters.

> Because of that, I now think I'm -1 on your "use printable characters
> for the encoding". I think presentation of the special characters
> _should_ look bogus in an app (eg little rectangles or whatever in a
> GUI); it's a fine flashing red light to the user.
>   

The reason I picked a ASCII printable character is just to make it 
easier for humans to see the encoding.  The scheme would also work with 
a non-ASCII non-printable character... but I fail to see how that would 
help a human compare the strings on a display of file names.  Having a 
bunch of abnormal characters in a row, displayed using a single 
replacement glyph, just makes an annoying mess in the file open dialog.

> Also, by avoiding reuse of legitimate characters in the encoding we can
> avoid your issue with losing track of where a string came from;
> legitimate characters are currently untouched by Martin's scheme, except
> for the normal "bytes<->string via the user's locale" translation that
> must already happen, and there you're aided by byets and strings being
> different types.
>   

There are abnormal characters, but there are no illegal characters.  
NTFS permits any 16-bit "character" code, including abnormal ones, 
including half-surrogates, and including full surrogate sequences that 
decode to PUA characters.  POSIX permits all byte sequences, including 
things that look like UTF-8, things that don't look like UTF-8, things 
that look like half-surrogates, and things that look like full surrogate 
sequences that decode to PUA characters.

So whether the decoding/encoding scheme uses common characters, or 
uncommon characters, you still have the issue of data puns, unless you 
use a 1-to-1 transformation, that is reversible.  With ASCII strings, I 
think no one questions that you need to escape the escape characters.  C 
uses \ as an escape character... Everyone understands that if you want 
to use a \ in a C string, you have to use \\ instead... and that scheme 
has escaped the boundaries of C to other use cases.  But it seems that 
you think that if we could just find one more character that no one else 
uses, that we wouldn't have to escape it.... and that could be true, but 
there aren't any characters that no one else uses.  So whatever 
character (and a range makes it worse) you pick, someone else uses it.  
So in order for the scheme to work, you have to escape the escape 
character(s), even in names that wouldn't otherwise need to be 
funny-decoded.

>> I'm certainly not experienced enough in Python development processes or  
>> internals to attempt such, as yet.  But somewhere in 25 years of  
>> programming, I picked up the knowledge that if you want to have a 1-to-1  
>> reversible mapping, you have to avoid data puns, mappings of two  
>> different data values into a single data value.  Your PEP, as first  
>> written, didn't seem to do that... since there are two interfaces from  
>> which to obtain data values, one performing a mapping from bytes to  
>> "funny invalid" Unicode, and the other performing no mapping, but  
>> accepting any sort of Unicode, possibly including "funny invalid"  
>> Unicode, the possibility of data puns seems to exist.  I may be  
>> misunderstanding something about the use cases that prevent these two  
>> sources of "funny invalid" Unicode from ever coexisting, but if so,  
>> perhaps you could point it out, or clarify the PEP.
>>     
>
> Please elucidate the "second source" of strings. I'm presuming you mean
> strings egenrated from scratch rather than obtained by something like
> listdir().
>   

POSIX has byte APIs for strings, that's one source, that is most under 
discussion.  Windows has both bytes and 16-bit APIs for strings... the 
16-bit APIs are generally mapped directly to UTF-16, but are not checked 
for UTF-16 validity, so all of Martin's funny-decoded files could be 
used for Windows file names on the 16-bit APIs.  And yes, strings can be 
generated from scratch.

> Given such a string with "funny invalid" stuff in it, and _absent_
> Martin's scheme, what do you expect the source of the strings to _expect_
> to happen to them if passed to open()? They still have to be converted
> to bytes at the POSIX layer anyway.

There is a fine encoding scheme that can take any str and encode to 
bytes: UTF-8.

The problem is that UTF-8 doesn't work to take any byte sequence and 
decode to str, and that means that special handling has to happen when 
such byte sequences are encountered.  But there is no str that can be 
generated that can't be generated in other ways, which would be properly 
encoded to a different byte sequence.  Hence there are data puns, no 
1-to-1 mapping.  Hence it seems obvious to me that the only complete 
solution is to have an escape character, and ensure that all strings are 
decoded and encoded.  As soon as you have an escape character, then you 
can decode anything into displayable, standard, Unicode, and you can 
create the reverse encoding unambiguously.

Without an escape character, you just have a heuristic that will work 
sometimes, and break sometimes.  If you believe non-UTF-8-decodable byte 
sequences are rare, you can ignore them.  That's what we do now, but 
people squawk.  If you believe that you can invent an encoding that has 
data puns, and that because of the character or characters involved are 
rare, that the problems that result can be ignored, fine... but people 
will squawk when they hit the problem... I'm just trying to squawk now, 
to point out that this is complexity for complexities sake, it adds 
complexity to trade one problem for a different problem, under the 
belief that the other problem is somehow rarer than the first.  And 
maybe it is, today.  I'd much rather have a solution that actually 
solves the problem.

If you don't like ? as the escape character, then pick U+10F01, and 
anytime a U+10F01 is encountered in a file name, double it.  And anytime 
there is an undecodable byte sequence, emit U+10F01, and then U+80 
through U+FF as a subsequent character for the first byte in the 
undecodable sequence, and restart the decoder with the next byte.  
That'll work too.  But use of rare, abnormal characters to take the 
place of undecodable bytes can never work, because of data puns, and 
valid use of the rare, abnormal characters.

Someone suggested treating the byte sequences of the rare, abnormal 
characters as undecodable bytes, and decoding them using the same 
substitution rules.  That would work too, if applied consistently, 
because then the rare, abnormal characters would each be escaped.  But 
having 128 escape characters seems more complex than necessary, also.

-- 
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking