[Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

Sat Apr 25 19:27:47 CEST 2009

Martin v. Löwis wrote:
>> I see two main user-oriented use cases for the resulting Unicode
>> strings this PEP will produce on all systems: displaying a list of
>> filenames for the user to select from (an open file dialog), and
>> allowing a user to edit or supply a filename (a save dialog or a
>> rename control).
> 
> There are more, in particular the case "user passes a file name
> on the command line", and "web server passes URL in environment
> variable".
> 
>> It's clear what this PEP provides for the former. On well-behaved
>> systems where a simpler filesystemencoding approach would work, the
>> results are identical; the user can select filenames that are what he
>> expects to see on both Unix and Windows. On less well-behaved systems,
>> some characters may appear as junk in the middle of the name (or would
>> they be invisible?)
> 
> Depends on the rendering. Try "print u'\udc00'" in your terminal to see
> what happens; for me, it renders the glyph for "replacement character".
> In GUI applications, you often see white boxes (rectangles).
> 
>> What I don't find clear is what the risks are for the latter. On the
>> less well behaved system, a user may well attempt to use this python
>> application to fix filenames. Can we estimate a likelihood that edits
>> to the names would result in a Unicode string that can no longer be
>> encoded with the python-escape? Will a new name fully provided by a
>> user on his keyboard (ignoring copy and paste) almost always safely
>> encode?
> 
> That very much depends on the system setup, and your impression is
> right that the PEP doesn't address it - it only deals with cases
> where you get random unsupported bytes; getting random unsupported
> characters from the user is not considered.
> 
> If the user has the locale setup in way that matches his keyboard,
> it should work all fine - and will already, even without the PEP.
> If the user enters a character that doesn't directly map to a
> good file name, you get an exception, and have to tell the user
> to pick a different filename.
> 
> Notice that it may fail at several layers:
> - it may be that characters entered are not supported in what
>   Python choses as the file system encoding.
> - it may be that the characters are not supported by the file
>   system, e.g. leading spaces in Win32.
> - it may be that the file cannot be renamed because the target
>   name already exists.
> In all these cases, the application has to ask the user to
> reconsider; for at least the last case, it should be prepared
> to do that, anyway (there is also the case where renaming fails
> because of lack of permissions; in that case, picking a different
> file name won't help).
> 
This has made me think about what happens going the other way, ie when a 
user-supplied Unicode string needs to be converted to UTF-8b. That 
should also be reversible.

Therefore:

When encoding using UTF-8b, codepoints in the range U+DC80..U+DCFF
should map to bytes 0x80..0xFF; all other codepoints, including the
remaining half surrogates, should be encoded normally.

When decoding using UTF-8b, undecodable bytes in the range 0x80..0xFF
should map to U+DC80..U+DCFF; all other bytes, including the encodings
for the remaining half surrogates, should be decoded normally.

This will ensure that even when the user has provided a string
containing half surrogates it can be encoded to bytes and then decoded
back to the original string.