[Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

Glenn Linderman v+python at g.nevcal.com
Mon Apr 27 10:40:43 CEST 2009

On approximately 4/27/2009 12:55 AM, came the following characters from 
the keyboard of Cameron Simpson:
> On 26Apr2009 23:39, Glenn Linderman <v+python at g.nevcal.com> wrote:
> [...snip...]
>> There are still issues regarding how Windows and POSIX programs that are  
>> sharing cross-mounted file systems might communicate file names between  
>> each other, which is not at all clear from the PEP.  If this is an  
>> insoluble or un-addressed issue, it should be stated.  (It is probably  
>> insoluble, due to there being multiple ways that the cross-mounted file  
>> systems might translate names; but if there are, can we learn something  
>> from the rules the mounting systems use, to be compatible with (one of)  
>> them, or not.
> I'd say that's out of scope. A windows filesystem mounted on a UNIX host
> should probably be mounted with a mapping to translate the Windows
> Unicode names into whatever the sysadmin deems the locally most apt
> byte encoding. But sys.getfilesystemencoding() is based on the current user's
> locale settings, which need not be the same.

And if it were, what would it do with files that can't be encoded with 
the locally most apt byte encoding?  That's where we might learn 
something about what behaviors are deemed acceptable.  Would such files 
be inaccessible?  Accessible with mangled names?  or what?

And for a Unix filesystem mounted on a Windows host?  Or accessed via 
some network connection?

>> Together with your change to avoid using PUA characters, and the rule  
>> suggested by MRAB in another branch of this thread, of treating  
>> half-surrogates as invalid byte sequences may avoid the data puns I'm  
>> concerned about.
>> It is not clear how half-surrogate characters would be displayed, when  
>> the user prints or displays such a file name string.  It would seem that  
>> programs that display file names to users might still have issues with  
>> such; an escaping mechanism that uses displayable characters would have  
>> an advantage there.
> Wouldn't any escaping mechanism that uses displayable characters
> require visually mangling occurences of those characters that
> legitimately occur in the original?

Yes.  My suggested use of ? is a visible character that is illegal in 
Windows file names, thus causing no valid Windows file names to be 
visually mangled.  It is also a character that should be avoided in 
POSIX names because:

1) it is known to be illegal on Windows, and thus non-portable
2) it is hard to write globs that match ? without allowing matches of 
other characters as well
3) it must be quoted to specify it on a command line

That said, someone provided a case where it is "easy" to get ? in POSIX 
file names.  The remaining question is whether that is a reasonable use 
case, a frequent use case, or a stupid use case; and whether the 
resulting visible mangling is more or less understandable and disruptive 
than using half-surrogates which are:

1) invalid Unicode
2) non-displayable
3) indistinguishable using normal non-displayable character substitution 

Glenn -- http://nevcal.com/
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

More information about the Python-Dev mailing list