On 03:32 am, foom@fuhm.net wrote:
On Sep 30, 2008, at 10:06 PM, glyph@divmod.com wrote:
Can you clarify what proposal you are supporting for Python:
Sure. Neither of your descriptions is terribly accurate, but I'll try to explain.
1) Two sets of APIs, one returning unicode strings, and one returning bytestrings. (subpoints: what does the unicode-returning API do when it cannot decode the bytestring into unicode? raise exception, pretend argument/envvar/file didn't exist/?)
The only API discussed so far which would actually provide two variants is 'getcwd', which would have a 'getcwdb' that gives back bytes instead. Pretty much every other API takes some kind of input. listdir(bytes) would give back bytes, while listdir(text) would give back text. listdir(text) would skip undecodable filenames. Similarly for all the other APIs in os and os.path that take pathnames for input.
2) All APIs return bytestrings only. Converting to unicode is considered lossy, and would have to be done by applications for display purposes only.
This is a bad way to do things, because on Windows, filenames *really are* unicode. Converting to bytes is what's lossy. (See previous discussion of active codepages and CreateFileA/CreateFileW.)
I really don't understand the reasoning for (1).
The reasoning is that a lot of software doesn't care if it's wrong for edge cases, it's really hard to come up with something that's correct with respect to all of those edge cases (absurdly difficult, if you need to stay in the straightjacket of string / bytes types, as well as provide a useful library interface - which is why we're having this discussion). But, it should be _possible_ to write software that's correct in the face of those edge cases. And - let's not forget this - the worlds of POSIX and Windows really are different and really do require subtly different inputs. Python can try to paper over this like Java does and make it impossible to write certain classes of application, or it can just provide an ugly, slightly inconsistent API that exposes the ugly, slightly inconsistent reality. Modulo the issues you've raised which I don't think the proposal totally covers yet (abspath with a non-decodable cwd) I think it strikes a nice balance; allow people to live in the delusion of unicode-on-POSIX and have software that mostly works, most of the time, or allow them to face the unpleasantness and spend the effort to get something really solid. I think the _right_ answer to all of this is to (A) make FilePath work completely correctly for every totally insane edge case ever, and (B) include it in the stdlib. One day I think we'll do that. But nobody has the time or energy to do even the first part of that *right now*, before 3.0 is released, so I'm just looking for something which it will be possible to build FilePath, or something like it, on top of, without breaking other people's applications who rely on the os module directly too badly.
It seems to me that most software (probably including all of the Python stdlib) would continue to use the unicode string API.
That's true. And that software wouldn't handle these edge cases completely correctly. As Guido put it, "it's a quality of implementation issue".
Switching all of the Python stdlib to use the bytestring APIs instead would certainly be a large undertaking, and would have all sorts of ripple-on API changes (e.g. __file__).
I am not quite sure what to do about __file__. My preference would probably be to use unicode filename for consistency so it can always be displayed, but provide a second attribute (__open_file__?) that would be sometimes unicode, sometimes bytes, which would be guaranteed to work with open(). I suspect that most software which interacts with __file__ on a deep level would be of the variety which would deal with the edge cases. But where the Python stdlib wants a pathname it should be accepting either bytes or unicode, as all of the os.path functions want. This does kind of suck, but the alternatives are to encode crazy extra information in unicode path names that cannot be exchanged with other programs (or with users: NULL is potentially the worst bogus character from a UI perspective), or revert to bytes for everything (which is a non-solution, c.f. Windows above).
So I can only imagine that if you're proposing (1), you're doing so without the intention of suggesting that Python be converted to use it.
Maybe updating the stdlib to be correct in the face of such changes is hard, but it doesn't seem intractible. Taken together, it looks like there are only about 100 calls in the stdlib to both getcwd and abspath together, and I suspect many of them are for purely aesthetic purposes and could just be eliminated, and many of them are redefinitions of the functions and don't need any changes. All the other path manipulation functions would continue to work as-is, although some of them might skip undecodable files.
And so, of course, that doesn't really fix things (such as getcwd failing if your cwd is a path that is undecodeable in the current locale, or well, currently, python refusing to even start).
The proposal as I understand it so far doesn't address this specifically, so I'll try to. os.getcwd, os.path.abspath, and os.path.realpath (when called with unicode) will probably need to do something gross if they're called on a non-decodable directory. One thing that comes to mind is to create a temporary symbolic link and return u'/tmp/python-$YOURUID-undecodable/$GUID/something'. I hope someone else has a better idea, especially since that sort of defeats the purpose of realpath. On the other hand, even this strawman answer is correct for pretty much any sane purpose, and if you _really_ care, you need to learn that you have to use and ask for bytes, on POSIX, to deal with such corner cases.
If you're proposing (2), (...)
Luckily I'm not.
The proposal of using U+0000 seems like it would have been almost the same from such a wrapper's perspective, except (A) people using the filesystem APIs without the benefit of such a wrapper would have been even more screwed
I'm not sure what your "more screwed" is comparing against: current py3k behavior? (aka: decoding to Unicode in locale's specified encoding)? I don't see how you can really be more screwed than that: not only can't you send your filename to display in a Gtk+ button, you can't access it at all, even staying within python.
You're screwed if you're trying to access files in a portable way without worrying at all about encodings. There are files you won't be able to access, there are conditions you won't be able to deal with. Sorry, but POSIX sucks and that's life. You're _more_ screwed if you're trying to access those files in a portable way without worrying about encodings, and the API you're using is giving you back invalid, magic path names, with NULLs rather than being slightly lossy and dropping filenames you (obviously, by virtue of the way you requested those filenames) won't be able to deal with. So I was talking here about the default behavior in the case of a naive program that wants to pretend all paths are unicode.
and (B) there are a few nasty corner-cases when dealing with surrogate (i.e. invalid, in UTF-8) code points which I'm not quite sure what it would have done with.
The lone-surrogate-pair proposal was a totally different proposal than the U+0000 one.
I wasn't referring to the lone-surrogate-pair encoding trick, I was referring to the fact that some people are going to want to treat surrogate pairs as encoding errors (i.e. include the NULL byte) and some will want to treat them as valid. If you want them to be valid you have to normalize away the surrogates in order to talk to other software, but you can't do that because then you'll get different bytes when you re- encode them. There's probably a way around that but it would be subtle and controversial no matter how you did it.