
On 22Apr2009 08:50, Martin v. L�wis <martin@v.loewis.de> wrote: | File names, environment variables, and command line arguments are | defined as being character data in POSIX; Specific citation please? I'd like to check the specifics of this. | the C APIs however allow | passing arbitrary bytes - whether these conform to a certain encoding | or not. Indeed. | This PEP proposes a means of dealing with such irregularities | by embedding the bytes in character strings in such a way that allows | recreation of the original byte string. [...] So you're proposing that all POSIX OS interfaces (which use byte strings) interpret those byte strings into Python3 str objects, with a codec that will accept arbitrary byte sequences losslessly and is totally reversible, yes? And, I hope, that the os.* interfaces silently use it by default. | For most applications, we assume that they eventually pass data | received from a system interface back into the same system | interfaces. For example, and application invoking os.listdir() will | likely pass the result strings back into APIs like os.stat() or | open(), which then encodes them back into their original byte | representation. Applications that need to process the original byte | strings can obtain them by encoding the character strings with the | file system encoding, passing "python-escape" as the error handler | name. -1 This last sentence kills the idea for me, unless I'm missing something. Which I may be, of course. POSIX filesystems _do_not_ have a file system encoding. The user's environment suggests a preferred encoding via the locale stuff, and apps honouring that will make nice looking byte strings as filenames for that user. (Some platforms, like MacOSX' HFS filesystems, _do_ enforce an encoding, and a quite specific variety of UTF-8 it is; I would say they're not a full UNIX filesystem _precisely_ because they reject certain byte strings that are valid on other UNIX filesystems. What will your proposal do here? I can imagine it might cope with existing names, but what happens when the user creates a new name?) Further, different users can use different locales and encodings. If they do it in different work areas they'll be perfectly happy; if they do it in a shared area doubtless confusion will reign, but only in the users' minds, not in the filesystem. If I'm writing a general purpose UNIX tool like chmod or find, I expect it to work reliably on _any_ UNIX pathname. It must be totally encoding blind. If I speak to the os.* interface to open a file, I expect to hand it bytes and have it behave. As an explicit example, I would be just fine with python's open(filename, "w") to take a string and encode it for use, but _not_ ok for os.open() to require me to supply a string and cross my fingers and hope something sane happens when it is turned into bytes for the UNIX system call. I'm very much in favour of being able to work in strings for most purposes, but if I use the os.* interfaces on a UNIX system it is necessary to be _able_ to work in bytes, because UNIX file pathnames are bytes. If there isn't a byte-safe os.* facility in Python3, it will simply be unsuitable for writing low level UNIX tools. And I very much like using Python2 for that. Finally, I have a small python program whose whole purpose in life is to transcode UNIX filenames before transfer to a MacOSX HFS directory, because of HFS's enforced particular encoding. What approach should a Python app take to transcode UNIX pathnames under your scheme? Cheers, -- Cameron Simpson <cs@zip.com.au> DoD#743 http://www.cskk.ezoshosting.com/cs/ The nice thing about standards is that you have so many to choose from; furthermore, if you do not like any of them, you can just wait for next year's model. - Andrew S. Tanenbaum