[Python-Dev] [Python-3000] New proposition for Python3 bytes filename issue

Tue Sep 30 22:04:42 CEST 2008

Guido van Rossum wrote:
> On Mon, Sep 29, 2008 at 11:00 PM, "Martin v. Löwis" <martin at v.loewis.de> wrote:
>>> Change the default file system encoding to store bytes in Unicode is like
>>> introducing a new Python type: <fake Unicode for filename hacks>.
>> Exactly. Seems like the best solution to me, despite your polemics.
> 
> Martin, I don't understand why you are in favor of storing raw bytes
> encoded as Latin-1 in Unicode string objects, which clearly gives rise
> to mojibake. In the past you have always been staunchly opposed to API
> changes or practices that could lead to mojibake (and you had me quite
> convinced).

True. I try to outweigh the need for simplicity in the API against the
need to support all cases. So I see two solutions:

a) support bytes as file names. Supports all cases, but complicates
   the API very much, by pervasively bringing bytes into the status
   of a character data type. IMO, this must be prevented at all costs.

b) make character (Unicode) strings the only string type. Does not
   immediately support all cases, so some hacks are needed. However,
   even with the hacks, it preserves the simplicity of the API; the
   hacks then should ideally be limited to the applications that need
   it. On this side, I see the following approaches:
   1. try to automatically embed non-representable characters into
      the Unicode strings, e.g. by using PUA characters. Reduces
      the amount of moji-bake, but produces a lot of difficult issues.
   2. let applications that desire so access all file names in a
      uniform manner, at the cost of producing tons of moji-bake

In this case, I think moji-bake is unavoidable: it is just a plain
flaw in the POSIX implementations (not the API or specification) that
you can run into file names where you can't come up with the right
rendering. Even for solution a), the resulting data cannot
be displayed "correctly" in all cases.

Currently, I favor b2, but haven't given up on b1, and they don't
exclude each other. b2 is simple to implement, and delegates the
choice between legible file names and universal access to all files
to the application. Given the way Unix works, this is the most sensible
choice, IMO: by default, Python should try to make file names legible,
but stuff like backup applications should be implementable also -
and they don't need legible file names.

I think option a) will hunt us forever. People will ask for more and
more features in the bytes type, eventually asking "give us Python
2.x strings back". It already starts: see #3982, where Benjamin
asks to have .format added to bytes (for a reason unrelated to file
names).

Regards,
Martin