[Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
v+python at g.nevcal.com
Mon Apr 27 10:40:43 CEST 2009
On approximately 4/27/2009 12:55 AM, came the following characters from
the keyboard of Cameron Simpson:
> On 26Apr2009 23:39, Glenn Linderman <v+python at g.nevcal.com> wrote:
>> There are still issues regarding how Windows and POSIX programs that are
>> sharing cross-mounted file systems might communicate file names between
>> each other, which is not at all clear from the PEP. If this is an
>> insoluble or un-addressed issue, it should be stated. (It is probably
>> insoluble, due to there being multiple ways that the cross-mounted file
>> systems might translate names; but if there are, can we learn something
>> from the rules the mounting systems use, to be compatible with (one of)
>> them, or not.
> I'd say that's out of scope. A windows filesystem mounted on a UNIX host
> should probably be mounted with a mapping to translate the Windows
> Unicode names into whatever the sysadmin deems the locally most apt
> byte encoding. But sys.getfilesystemencoding() is based on the current user's
> locale settings, which need not be the same.
And if it were, what would it do with files that can't be encoded with
the locally most apt byte encoding? That's where we might learn
something about what behaviors are deemed acceptable. Would such files
be inaccessible? Accessible with mangled names? or what?
And for a Unix filesystem mounted on a Windows host? Or accessed via
some network connection?
>> Together with your change to avoid using PUA characters, and the rule
>> suggested by MRAB in another branch of this thread, of treating
>> half-surrogates as invalid byte sequences may avoid the data puns I'm
>> concerned about.
>> It is not clear how half-surrogate characters would be displayed, when
>> the user prints or displays such a file name string. It would seem that
>> programs that display file names to users might still have issues with
>> such; an escaping mechanism that uses displayable characters would have
>> an advantage there.
> Wouldn't any escaping mechanism that uses displayable characters
> require visually mangling occurences of those characters that
> legitimately occur in the original?
Yes. My suggested use of ? is a visible character that is illegal in
Windows file names, thus causing no valid Windows file names to be
visually mangled. It is also a character that should be avoided in
POSIX names because:
1) it is known to be illegal on Windows, and thus non-portable
2) it is hard to write globs that match ? without allowing matches of
other characters as well
3) it must be quoted to specify it on a command line
That said, someone provided a case where it is "easy" to get ? in POSIX
file names. The remaining question is whether that is a reasonable use
case, a frequent use case, or a stupid use case; and whether the
resulting visible mangling is more or less understandable and disruptive
than using half-surrogates which are:
1) invalid Unicode
3) indistinguishable using normal non-displayable character substitution
Glenn -- http://nevcal.com/
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
More information about the Python-Dev