[Python-Dev] [Python-3000] New proposition for Python3 bytes filename issue

Wed Oct 1 07:19:47 CEST 2008

On 03:32 am, foom at fuhm.net wrote:
>On Sep 30, 2008, at 10:06 PM, glyph at divmod.com wrote:

>Can you clarify what proposal you are supporting for Python:

Sure.  Neither of your descriptions is terribly accurate, but I'll try 
to explain.
>1) Two sets of APIs, one returning unicode strings, and one returning 
>bytestrings. (subpoints: what does the unicode-returning API do when 
>it cannot decode the bytestring into unicode? raise exception, pretend 
>argument/envvar/file didn't exist/?)

The only API discussed so far which would actually provide two variants 
is 'getcwd', which would have a 'getcwdb' that gives back bytes instead.

Pretty much every other API takes some kind of input.  listdir(bytes) 
would give back bytes, while listdir(text) would give back text. 
listdir(text) would skip undecodable filenames.

Similarly for all the other APIs in os and os.path that take pathnames 
for input.
>2) All APIs return bytestrings only. Converting to unicode is 
>considered lossy, and would have to be done by applications for 
>display purposes only.

This is a bad way to do things, because on Windows, filenames *really 
are* unicode.  Converting to bytes is what's lossy.  (See previous 
discussion of active codepages and CreateFileA/CreateFileW.)
>I really don't understand the reasoning for (1).

The reasoning is that a lot of software doesn't care if it's wrong for 
edge cases, it's really hard to come up with something that's correct 
with respect to all of those edge cases (absurdly difficult, if you need 
to stay in the straightjacket of string / bytes types, as well as 
provide a useful library interface - which is why we're having this 
discussion).  But, it should be _possible_ to write software that's 
correct in the face of those edge cases.

And - let's not forget this - the worlds of POSIX and Windows really are 
different and really do require subtly different inputs.  Python can try 
to paper over this like Java does and make it impossible to write 
certain classes of application, or it can just provide an ugly, slightly 
inconsistent API that exposes the ugly, slightly inconsistent reality. 
Modulo the issues you've raised which I don't think the proposal totally 
covers yet (abspath with a non-decodable cwd) I think it strikes a nice 
balance; allow people to live in the delusion of unicode-on-POSIX and 
have software that mostly works, most of the time, or allow them to face 
the unpleasantness and spend the effort to get something really solid.

I think the _right_ answer to all of this is to (A) make FilePath work 
completely correctly for every totally insane edge case ever, and (B) 
include it in the stdlib.  One day I think we'll do that.  But nobody 
has the time or energy to do even the first part of that *right now*, 
before 3.0 is released, so I'm just looking for something which it will 
be possible to build FilePath, or something like it, on top of, without 
breaking other people's applications who rely on the os module directly 
too badly.
>It seems to me that  most software (probably including all of the 
>Python stdlib) would  continue to use the unicode string API.

That's true.  And that software wouldn't handle these edge cases 
completely correctly.  As Guido put it, "it's a quality of 
implementation issue".
>Switching all of the Python  stdlib to use the bytestring APIs instead 
>would certainly be a large  undertaking, and would have all sorts of 
>ripple-on API changes (e.g.  __file__).

I am not quite sure what to do about __file__.  My preference would 
probably be to use unicode filename for consistency so it can always be 
displayed, but provide a second attribute (__open_file__?) that would be 
sometimes unicode, sometimes bytes, which would be guaranteed to work 
with open().  I suspect that most software which interacts with __file__ 
on a deep level would be of the variety which would deal with the edge 
cases.

But where the Python stdlib wants a pathname it should be accepting 
either bytes or unicode, as all of the os.path functions want.  This 
does kind of suck, but the alternatives are to encode crazy extra 
information in unicode path names that cannot be exchanged with other 
programs (or with users: NULL is potentially the worst bogus character 
from a UI perspective), or revert to bytes for everything (which is a 
non-solution, c.f. Windows above).
>So I can only imagine that if you're proposing (1), you're  doing so 
>without the intention of suggesting that Python be converted  to use 
>it.

Maybe updating the stdlib to be correct in the face of such changes is 
hard, but it doesn't seem intractible.  Taken together, it looks like 
there are only about 100 calls in the stdlib to both getcwd and abspath 
together, and I suspect many of them are for purely aesthetic purposes 
and could just be eliminated, and many of them are redefinitions of the 
functions and don't need any changes.

All the other path manipulation functions would continue to work as-is, 
although some of them might skip undecodable files.
>And so, of course, that doesn't really fix things (such as getcwd 
>failing if your cwd is a path that is undecodeable in the current 
>locale, or well, currently, python refusing to even start).

The proposal as I understand it so far doesn't address this 
specifically, so I'll try to.  os.getcwd, os.path.abspath, and 
os.path.realpath (when called with unicode) will probably need to do 
something gross if they're called on a non-decodable directory.  One 
thing that comes to mind is to create a temporary symbolic link and 
return u'/tmp/python-$YOURUID-undecodable/$GUID/something'.  I hope 
someone else has a better idea, especially since that sort of defeats 
the purpose of realpath.

On the other hand, even this strawman answer is correct for pretty much 
any sane purpose, and if you _really_ care, you need to learn that you 
have to use and ask for bytes, on POSIX, to deal with such corner cases.
>If you're proposing (2),  (...)

Luckily I'm not.
>>The proposal of using U+0000 seems like it would have been almost  the 
>>same from such a wrapper's perspective, except (A) people using  the 
>>filesystem APIs without the benefit of such a wrapper would have  been 
>>even more screwed
>
>I'm not sure what your "more screwed" is comparing against: current 
>py3k behavior? (aka: decoding to Unicode in locale's specified 
>encoding)? I don't see how you can really be more screwed than that: 
>not only can't you send your filename to display in a Gtk+ button, you 
>can't access it at all, even staying within python.

You're screwed if you're trying to access files in a portable way 
without worrying at all about encodings.  There are files you won't be 
able to access, there are conditions you won't be able to deal with. 
Sorry, but POSIX sucks and that's life.

You're _more_ screwed if you're trying to access those files in a 
portable way without worrying about encodings, and the API you're using 
is giving you back invalid, magic path names, with NULLs rather than 
being slightly lossy and dropping filenames you (obviously, by virtue of 
the way you requested those filenames) won't be able to deal with.

So I was talking here about the default behavior in the case of a naive 
program that wants to pretend all paths are unicode.
>>and (B) there are a few nasty corner-cases when dealing with 
>>surrogate (i.e. invalid, in UTF-8) code points which I'm not quite 
>>sure what it would have done with.
>
>The lone-surrogate-pair proposal was a totally different proposal than 
>the U+0000 one.

I wasn't referring to the lone-surrogate-pair encoding trick, I was 
referring to the fact that some people are going to want to treat 
surrogate pairs as encoding errors (i.e. include the NULL byte) and some 
will want to treat them as valid.  If you want them to be valid you have 
to normalize away the surrogates in order to talk to other software, but 
you can't do that because then you'll get different bytes when you re- 
encode them.

There's probably a way around that but it would be subtle and 
controversial no matter how you did it.