[Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

Glenn Linderman v+python at g.nevcal.com
Tue Apr 28 08:52:48 CEST 2009

On approximately 4/27/2009 7:11 PM, came the following characters from 
the keyboard of Cameron Simpson:
> On 27Apr2009 18:15, Glenn Linderman <v+python at g.nevcal.com> wrote:
>>>>>> The problem with this, and other preceding schemes that have been
>>>>>> discussed here, is that there is no means of ascertaining whether a
>>>>>> particular file name str was obtained from a str API, or was funny-
>>>>>> decoded from a bytes API... and thus, there is no means of reliably
>>>>>> ascertaining whether a particular filename str should be passed to a
>>>>>> str API, or funny-encoded back to bytes.
>>>>> Why is it necessary that you are able to make this distinction?
>>>> It is necessary that programs (not me) can make the distinction, so 
>>>> that  it knows whether or not to do the funny-encoding or not.
>>> I would say this isn't so. It's important that programs know if they're
>>> dealing with strings-for-filenames, but not that they be able to figure
>>> that out "a priori" if handed a bare string (especially since they
>>> can't:-)
>> So you agree they can't... that there are data puns.   (OK, you may not  
>> have thought that through)
> I agree you can't examine a string and know if it came from the os.* munging
> or from someone else's munging.
> I totally disagree that this is a problem.
> There may be puns. So what? Use the right strings for the right purpose
> and all will be well.
> I think what is missing here, and missing from Martin's PEP, is some
> utility functions for the os.* namespace.
> PROPOSAL: add to the PEP the following functions:
>   os.fsdecode(bytes) -> funny-encoded Unicode
>     This is what os.listdir() does to produce the strings it hands out.
>   os.fsencode(funny-string) -> bytes
>     This is what open(filename,..) does to turn the filename into bytes
>     for the POSIX open.
>   os.pathencode(your-string) -> funny-encoded-Unicode
>     This is what you must do to a de novo string to turn it into a
>     string suitable for use by open.
>     Importantly, for most strings not hand crafted to have weird
>     sequences in them, it is a no-op. But it will recode your puns
>     for survival.
> and for me, I would like to see:
>   os.setfilesystemencoding(coding)
> Currently os.getfilesystemencoding() returns you the encoding based on
> the current locale, and (I trust) the os.* stuff encodes on that basis.
> setfilesystemencoding() would override that, unless coding==None in what
> case it reverts to the former "use the user's current locale" behaviour.
> (We have locale "C" for what one might otherwise expect None to mean:-)
> The idea here is to let to program control the codec used for filenames
> for special purposes, without working indirectly through the locale.
>>>> If a name is  funny-decoded when the name is accessed by a directory 
>>>> listing, it needs  to be funny-encoded in order to open the file.
>>> Hmm. I had thought that legitimate unicode strings already get transcoded
>>> to bytes via the mapping specified by sys.getfilesystemencoding()
>>> (the user's locale). That already happens I believe, and Martin's
>>> scheme doesn't change this. He's just funny-encoding non-decodable byte
>>> sequences, not the decoded stuff that surrounds them.
>> So assume a non-decodable sequence in a name.  That puts us into  
>> Martin's funny-decode scheme.  His funny-decode scheme produces a bare  
>> string, indistinguishable from a bare string that would be produced by a  
>> str API that happens to contain that same sequence.  Data puns.
> See my proposal above. Does it address your concerns? A program still
> must know the providence of the string, and _if_ you're working with
> non-decodable sequences in a names then you should transmute then into
> the funny encoding using the os.pathencode() function described above.
> In this way the punning issue can be avoided.
> _Lacking_ such a function, your punning concern is valid.

Seems like one would also desire os.pathdecode to do the reverse.  And 
also versions that take or produce bytes from funny-encoded strings.

Then, if programs were re-coded to perform these transformations on what 
you call de novo strings, then the scheme would work.

But I think a large part of the incentive for the PEP is to try to 
invent a scheme that intentionally allows for the puns, so that programs 
do not need to be recoded in this manner, and yet still work.  I don't 
think such a scheme exists.

If there is going to be a required transformation from de novo strings 
to funny-encoded strings, then why not make one that people can actually 
see and compare and decode from the displayable form, by using 
displayable characters instead of lone surrogates?

>> So when open is handed the string, should it open the file with the name  
>> that matches the string, or the file with the name that funny-decodes to  
>> the same string?  It can't know, unless it knows that the string is a  
>> funny-decoded string or not.
> True. open() should always expect a funny-encoded name.
>>> So it is already the case that strings get decoded to bytes by
>>> calls like open(). Martin isn't changing that.
>> I thought the process of converting strings to bytes is called encoding.  
>> You seem to be calling it decoding?
> My head must be standing in the wrong place. Yes, I probably mean
> encoding here. I'm trying to accompany these terms with little pictures
> like "string->bytes" to avoid confusion.
>>> I suppose if your program carefully constructs a unicode string riddled
>>> with half-surrogates etc and imagines something specific should happen
>>> to them on the way to being POSIX bytes then you might have a problem...
>> Right.  Or someone else's program does that.  I only want to use Unicode  
>> file names.  But if those other file names exist, I want to be able to  
>> access them, and not accidentally get a different file.
> Point taken. And I think addressed by the utility function proposed
> above.
> [...snip normal versus odd chars for the funny-encoding ...]
>>> Also, by avoiding reuse of legitimate characters in the encoding we can
>>> avoid your issue with losing track of where a string came from;
>>> legitimate characters are currently untouched by Martin's scheme, except
>>> for the normal "bytes<->string via the user's locale" translation that
>>> must already happen, and there you're aided by byets and strings being
>>> different types.
>> There are abnormal characters, but there are no illegal characters.   
> I though half-surrogates were illegal in well formed Unicode. I confess
> to being weak in this area. By "legitimate" above I meant things like
> half-surrogates which, like quarks, should not occur alone?

"Illegal" just means violating the accepted rules.  In this case, the 
accepted rules are those enforced by the file system (at the bytes or 
str API levels), and by Python (for the str manipulations).  None of 
those rules outlaw lone surrogates.  Hence, while all of the systems 
under discussion can handle all Unicode characters in one way or 
another, none of them require that all Unicode rules are followed.  Yes, 
you are correct that lone surrogates are illegal in Unicode.  No, none 
of the accepted rules for these systems require Unicode.

>> NTFS permits any 16-bit "character" code, including abnormal ones,  
>> including half-surrogates, and including full surrogate sequences that  
>> decode to PUA characters.  POSIX permits all byte sequences, including  
>> things that look like UTF-8, things that don't look like UTF-8, things  
>> that look like half-surrogates, and things that look like full surrogate  
>> sequences that decode to PUA characters.
> Sure. I'm not really talking about what filesystem will accept at
> the native layer, I was talking in the python funny-encoded space.
> [..."escaping is necessary"... I agree...]
>>>> I'm certainly not experienced enough in Python development processes 
>>>> or  internals to attempt such, as yet.  But somewhere in 25 years of  
>>>> programming, I picked up the knowledge that if you want to have a 
>>>> 1-to-1  reversible mapping, you have to avoid data puns, mappings of 
>>>> two  different data values into a single data value.  Your PEP, as 
>>>> first  written, didn't seem to do that... since there are two 
>>>> interfaces from  which to obtain data values, one performing a 
>>>> mapping from bytes to  "funny invalid" Unicode, and the other 
>>>> performing no mapping, but  accepting any sort of Unicode, possibly 
>>>> including "funny invalid"  Unicode, the possibility of data puns 
>>>> seems to exist.  I may be  misunderstanding something about the use 
>>>> cases that prevent these two  sources of "funny invalid" Unicode from 
>>>> ever coexisting, but if so,  perhaps you could point it out, or 
>>>> clarify the PEP.
>>> Please elucidate the "second source" of strings. I'm presuming you mean
>>> strings egenrated from scratch rather than obtained by something like
>>> listdir().
>> POSIX has byte APIs for strings, that's one source, that is most under  
>> discussion.  Windows has both bytes and 16-bit APIs for strings... the  
>> 16-bit APIs are generally mapped directly to UTF-16, but are not checked  
>> for UTF-16 validity, so all of Martin's funny-decoded files could be  
>> used for Windows file names on the 16-bit APIs.
> These are existing file objects, I'll take them as source 1. They get
> encoded for release by os.listdir() et al.
>> And yes, strings can be  
>> generated from scratch.
> I take this to be source 2.

One variation of source 2 is reading output from other programs, such as 
ls (POSIX) or dir (Windows).

> I think I agree with all the discussion that followed, and think the
> real problem is lack of utlities functions to funny-encode source 2
> strings for use. hence the proposal above.

I think we understand each other now.  I think your proposal could work, 
Cameron, although when recoding applications to use your proposal, I'd 
find it easier to use the "file name object" that others have proposed.  
I think that because either your proposal or the object proposals 
require recoding the application, that they will not be accepted.  I 
think that because the PEP 383 allows data puns, that it should not be 
accepted in its present form.

I think your if your proposal is accepted, that it then becomes possible 
to use an encoding that uses visible characters, which makes it easier 
for people to understand and verify.  An encoding such as the one I 
suggested, but perhaps using a more obscure character, if there is one, 
but yet doesn't violate true Unicode.  I think it should transform all 
data, from str and bytes interfaces, and produce only str values 
containing conforming Unicode, escaping all the non-conforming sequences 
in some manner.  This would make the strings truly readable, as long as 
fonts for all the characters are available.  And I had already suggested 
the utility functions you are suggesting, actually, in my first tirade 
against PEP 383 (search for "The encode and decode functions should be 
available for coders to use, that code to external
interfaces, either OS or 3rd party packages, that do not use this 
encoding scheme").  I really don't care if you or who gets the credit 
for the idea, others may have suggested it before me, but I do care that 
the solution should provide functionality that works without 
ambiguity/data puns.

The solution that was proposed in the lead up to releasing Python 3.0 
was to offer both bytes and str interfaces (so we have those), and then 
for those that want to have a single portable implementation that can 
access all data, an object that encapsulates the differences, and the 
variant system APIs.  (file system is one, command line is another, 
environment is another, I'm not sure if there are more.)  I haven't 
heard if any progress on such an encapsulating object has been made; the 
people that proposed such have been rather quiet about this PEP.  I 
would expect that an object implementation would provide display 
strings, and APIs to submit de novo str and bytes values to an object, 
which would run the appropriate encoding on them.

Programs that want to use str interfaces on POSIX will see a subset of 
files on systems that contain files whose bytes filenames are not 
decodable.  If a sysadmin wants to standardize on UTF-8 names 
universally, they can use something like convmv to clean up existing 
file names that don't conform.  Programs that use str interfaces on 
POSIX system will work fine, but with a subset of the files.  When that 
is unacceptable, they can either be recoded to use the bytes interfaces, 
or the hopefully forthcoming object encapsulation.  The issue then will 
be what technique will be used to transform bytes into display names, 
but since the display names would never be fed back to the objects 
directly (but the object would have an interface to accept de novo str 
and de novo bytes) then it is just a display issue, and one that uses 
visible characters would seem more useful in my mind, than one that uses 
half-surrogates or PUAs.

Glenn -- http://nevcal.com/
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

More information about the Python-Dev mailing list