[Python-3000] [Python-Dev] Filename as byte string in python 2.6 or 3.0?
James Y Knight
foom at fuhm.net
Mon Sep 29 06:43:55 CEST 2008
On Sep 28, 2008, at 7:21 PM, Gregory P. Smith wrote:
> On Sun, Sep 28, 2008 at 2:13 PM, "Martin v. Löwis"
> <martin at v.loewis.de> wrote:
>>> "broken" systems will always exist. Code to deal with them must be
>>> possible to write in python 3.0.
>>
>> Python 3.0 will have bugs. This might just be one of them. I can
>> agree
>> that Python 3.x will need to support that somehow, but perhaps not
>> 3.0.
>>
>> Regards,
>> Martin
>
> Agreed. At this point I think we just need to get 3.0 out there and
> be willing to fix flaws like this for 3.1 or in some cases for 3.0.1.
This problem sure would be "practically" solved simply by switching
the way the filesystemencoding is selected. You'll note that if you
want things to Just Work for a backup tool with today's Py3k, all you
need to do is switch the filesystem encoding to iso-8859-1. In that
encoding, every byte string has an associated unique unicode string,
so there's no problem with any possible filename.
With that in mind, here's my proposal:
a) Whenever ASCII would be selected as a filesystem encoding, use
iso-8859-1 instead.
a) Whenever UTF-8 would be selected as a filesystem encoding, use
UTF-8b [1] instead.
It's clearly not a 100% perfect solution, but it completely solves the
issue for users with the most popular filesystem encodings: ASCII,
iso-8859-1, and UTF-8. IMO, that's good enough to just leave things
there.
But even if it's deemed not good enough, and the byte-string level
file access APIs are all implemented, I *still* think doing the above
is a good idea. It makes unicode string file/environment/argv access
work in a huge majority of cases: a) windows always, b) Mac OS X
always, c) ASCII locale always, d) ISO-8859-1 locale always, e) UTF-8
locale always, f) other locales when the filenames really are encoded
in their locale.
It will make users happy, and it's simple enough to implement for
python 3.0.
James
[1] UTF-8b has a similar property to 8859-1, in that all byte strings
can be successfully round-tripped. It's not currently implemented in
python core, but it's a pretty trivial encoding, and is available
under the BSD license, see below.
Background:
http://mail.nl.linux.org/linux-utf8/2000-07/msg00040.html
Blog post:
http://bsittler.livejournal.com/10381.html
Implementation for python:
http://hyperreal.org/~est/libutf8b/
James
More information about the Python-3000
mailing list