On Thu, Oct 2, 2008 at 10:08 AM, Fred Drake <fdrake@acm.org> wrote:
On Oct 2, 2008, at 9:39 AM, Nick Coghlan wrote:
If you don't make a habit of borking your own filesystems with dodgy filenames, it runs fine.
I really hope the individuals making this argument are being facetious. I don't think this is the source of the problem at all.
The expect the most common occurrence of the problem comes from sharing of drives between operating systems and individual configurations; those ubiquitous little USB "thumb" drives get shared between all kinds of computers these days as people share files they don't want to or can't pass over a network for whatever reason. (Those drives might actually serve other purposes first, such as being music players, and so may have no other interfaces for transferring files.)
If someone hands me a USB flash drive with filenames encoded in whatever is reasonable for them, I should be able to use Python tools on the files without having to use non-Python tools to copy or rename the file first. The possibility of a conflicting encoding is increased if the source machine is configured to use a very different encoding, clearly, but that's not that unusual.
The world is smaller than it used to be, and we really need to understand that.
All good points.
However no matter how you spin it, we're in a hard place. If we maintain that filenames should always be represented as text strings, we have no choice of coming up with a way of encoding all possible byte sequences into Unicode strings, using a reversible encoding. This has been shown to be hard no matter what encoding you favor -- as soon as those "Unicode" strings are passed on to other libraries or programs nobody is sure how they will be treated.
If we switch to the view that all filenames are bytes after all, Windows loses, because because not all filenames are representable that way (unless you deviate from the encoding that Windows has chosen for you, which presents other problems). Also, it would be a *huge* project, since filenames are so ubiquitous.
There are a number of ways out, but I don't think we'll be able to come up with a solution without doing a lot of experimentation. Therefore I believe the best thing to do is to release 3.0 with a low-level solution that makes it possible to carry out those experiments. I am hoping that Martin will check in his sys.setfilesystemencoding() function, and am am working on Victor Stinner's code for better supporting filenames-as-bytes (in addition to, not instead of filenames-as-text), and I expect that these two are together to allow the necessary experimentation to take place post-3.0.
-- --Guido van Rossum (home page: http://www.python.org/~guido/)