[Python-Dev] a suggestion ... Re: PEP 383 (again)

Tue Apr 28 21:50:55 CEST 2009

On Apr 28, 2009, at 13:01 PM, Thomas Breuel wrote:

> (2) Should the default UTF-8 encoder for file system operations be  
> allowed to generate illegal byte sequences?
>
> I think that's a definite no; if I set the encoding for a device to  
> UTF-8, I never want Python to try to write illegal UTF-8 strings to  
> my device.
...
> If people really want the option of (3c), then I think encoders  
> related to the file system should by default reject those strings  
> as illegal because the potential problems from writing them are  
> just too serious.  Printing routines and UI routines could display  
> them without error (but some clear indication), of course.

For what it is worth, sometimes we have to write bytes to a POSIX  
filesystem even though those bytes are not the encoding of any string  
in the filesystem's "alleged encoding".  The reason is that it is  
common for there to be filenames which are not the encodings of  
anything in the filesystem's alleged encoding, and the user expects  
my tool (Tahoe-LAFS [1]) to copy that name to a distributed storage  
grid and then copy it back unchanged.  Even though, I re-iterate,  
that name is *not* a valid encoding of anything in the current encoding.

This doesn't argue that this behavior has to be the *default*  
behavior, but it is sometimes necessary.

It's too bad that POSIX is so far behind Mac OS X in this respect.   
(Also so far behind Windows, but I use Mac as the example to show how  
it is possible to build a better system on top of POSIX.)  Hopefully  
David Wheeler's proposals to tighten the requirements in Linux  
filesystems will catch on: [2].

Regards,

Zooko

[1] http://allmydata.org
[2] http://www.dwheeler.com/essays/fixing-unix-linux-filenames.html