[Python-3000] [Python-Dev] Filename as byte string in python 2.6 or 3.0?

James Y Knight foom at fuhm.net
Mon Sep 29 06:43:55 CEST 2008


On Sep 28, 2008, at 7:21 PM, Gregory P. Smith wrote:

> On Sun, Sep 28, 2008 at 2:13 PM, "Martin v. Löwis"  
> <martin at v.loewis.de> wrote:
>>> "broken" systems will always exist.  Code to deal with them must be
>>> possible to write in python 3.0.
>>
>> Python 3.0 will have bugs. This might just be one of them. I can  
>> agree
>> that Python 3.x will need to support that somehow, but perhaps not  
>> 3.0.
>>
>> Regards,
>> Martin
>
> Agreed.  At this point I think we just need to get 3.0 out there and
> be willing to fix flaws like this for 3.1 or in some cases for 3.0.1.

This problem sure would be "practically" solved simply by switching  
the way the filesystemencoding is selected. You'll note that if you  
want things to Just Work for a backup tool with today's Py3k, all you  
need to do is switch the filesystem encoding to iso-8859-1. In that  
encoding, every byte string has an associated unique unicode string,  
so there's no problem with any possible filename.

With that in mind, here's my proposal:
a) Whenever ASCII would be selected as a filesystem encoding, use  
iso-8859-1 instead.
a) Whenever UTF-8 would be selected as a filesystem encoding, use  
UTF-8b [1] instead.

It's clearly not a 100% perfect solution, but it completely solves the  
issue for users with the most popular filesystem encodings: ASCII,  
iso-8859-1, and UTF-8. IMO, that's good enough to just leave things  
there.

But even if it's deemed not good enough, and the byte-string level  
file access APIs are all implemented, I *still* think doing the above  
is a good idea. It makes unicode string file/environment/argv access  
work in a huge majority of cases: a) windows always, b) Mac OS X  
always, c) ASCII locale always, d) ISO-8859-1 locale always, e) UTF-8  
locale always, f) other locales when the filenames really are encoded  
in their locale.

It will make users happy, and it's simple enough to implement for  
python 3.0.


James

[1] UTF-8b has a similar property to 8859-1, in that all byte strings  
can be successfully round-tripped. It's not currently implemented in  
python core, but it's a pretty trivial encoding, and is available  
under the BSD license, see below.

Background:
http://mail.nl.linux.org/linux-utf8/2000-07/msg00040.html

Blog post:
http://bsittler.livejournal.com/10381.html

Implementation for python:
http://hyperreal.org/~est/libutf8b/

James


More information about the Python-3000 mailing list