[Python-3000] [Python-Dev] New proposition for Python3 bytes filename issue
glyph at divmod.com
glyph at divmod.com
Wed Oct 1 07:19:47 CEST 2008
On 03:32 am, foom at fuhm.net wrote:
>On Sep 30, 2008, at 10:06 PM, glyph at divmod.com wrote:
>Can you clarify what proposal you are supporting for Python:
Sure. Neither of your descriptions is terribly accurate, but I'll try
to explain.
>1) Two sets of APIs, one returning unicode strings, and one returning
>bytestrings. (subpoints: what does the unicode-returning API do when
>it cannot decode the bytestring into unicode? raise exception, pretend
>argument/envvar/file didn't exist/?)
The only API discussed so far which would actually provide two variants
is 'getcwd', which would have a 'getcwdb' that gives back bytes instead.
Pretty much every other API takes some kind of input. listdir(bytes)
would give back bytes, while listdir(text) would give back text.
listdir(text) would skip undecodable filenames.
Similarly for all the other APIs in os and os.path that take pathnames
for input.
>2) All APIs return bytestrings only. Converting to unicode is
>considered lossy, and would have to be done by applications for
>display purposes only.
This is a bad way to do things, because on Windows, filenames *really
are* unicode. Converting to bytes is what's lossy. (See previous
discussion of active codepages and CreateFileA/CreateFileW.)
>I really don't understand the reasoning for (1).
The reasoning is that a lot of software doesn't care if it's wrong for
edge cases, it's really hard to come up with something that's correct
with respect to all of those edge cases (absurdly difficult, if you need
to stay in the straightjacket of string / bytes types, as well as
provide a useful library interface - which is why we're having this
discussion). But, it should be _possible_ to write software that's
correct in the face of those edge cases.
And - let's not forget this - the worlds of POSIX and Windows really are
different and really do require subtly different inputs. Python can try
to paper over this like Java does and make it impossible to write
certain classes of application, or it can just provide an ugly, slightly
inconsistent API that exposes the ugly, slightly inconsistent reality.
Modulo the issues you've raised which I don't think the proposal totally
covers yet (abspath with a non-decodable cwd) I think it strikes a nice
balance; allow people to live in the delusion of unicode-on-POSIX and
have software that mostly works, most of the time, or allow them to face
the unpleasantness and spend the effort to get something really solid.
I think the _right_ answer to all of this is to (A) make FilePath work
completely correctly for every totally insane edge case ever, and (B)
include it in the stdlib. One day I think we'll do that. But nobody
has the time or energy to do even the first part of that *right now*,
before 3.0 is released, so I'm just looking for something which it will
be possible to build FilePath, or something like it, on top of, without
breaking other people's applications who rely on the os module directly
too badly.
>It seems to me that most software (probably including all of the
>Python stdlib) would continue to use the unicode string API.
That's true. And that software wouldn't handle these edge cases
completely correctly. As Guido put it, "it's a quality of
implementation issue".
>Switching all of the Python stdlib to use the bytestring APIs instead
>would certainly be a large undertaking, and would have all sorts of
>ripple-on API changes (e.g. __file__).
I am not quite sure what to do about __file__. My preference would
probably be to use unicode filename for consistency so it can always be
displayed, but provide a second attribute (__open_file__?) that would be
sometimes unicode, sometimes bytes, which would be guaranteed to work
with open(). I suspect that most software which interacts with __file__
on a deep level would be of the variety which would deal with the edge
cases.
But where the Python stdlib wants a pathname it should be accepting
either bytes or unicode, as all of the os.path functions want. This
does kind of suck, but the alternatives are to encode crazy extra
information in unicode path names that cannot be exchanged with other
programs (or with users: NULL is potentially the worst bogus character
from a UI perspective), or revert to bytes for everything (which is a
non-solution, c.f. Windows above).
>So I can only imagine that if you're proposing (1), you're doing so
>without the intention of suggesting that Python be converted to use
>it.
Maybe updating the stdlib to be correct in the face of such changes is
hard, but it doesn't seem intractible. Taken together, it looks like
there are only about 100 calls in the stdlib to both getcwd and abspath
together, and I suspect many of them are for purely aesthetic purposes
and could just be eliminated, and many of them are redefinitions of the
functions and don't need any changes.
All the other path manipulation functions would continue to work as-is,
although some of them might skip undecodable files.
>And so, of course, that doesn't really fix things (such as getcwd
>failing if your cwd is a path that is undecodeable in the current
>locale, or well, currently, python refusing to even start).
The proposal as I understand it so far doesn't address this
specifically, so I'll try to. os.getcwd, os.path.abspath, and
os.path.realpath (when called with unicode) will probably need to do
something gross if they're called on a non-decodable directory. One
thing that comes to mind is to create a temporary symbolic link and
return u'/tmp/python-$YOURUID-undecodable/$GUID/something'. I hope
someone else has a better idea, especially since that sort of defeats
the purpose of realpath.
On the other hand, even this strawman answer is correct for pretty much
any sane purpose, and if you _really_ care, you need to learn that you
have to use and ask for bytes, on POSIX, to deal with such corner cases.
>If you're proposing (2), (...)
Luckily I'm not.
>>The proposal of using U+0000 seems like it would have been almost the
>>same from such a wrapper's perspective, except (A) people using the
>>filesystem APIs without the benefit of such a wrapper would have been
>>even more screwed
>
>I'm not sure what your "more screwed" is comparing against: current
>py3k behavior? (aka: decoding to Unicode in locale's specified
>encoding)? I don't see how you can really be more screwed than that:
>not only can't you send your filename to display in a Gtk+ button, you
>can't access it at all, even staying within python.
You're screwed if you're trying to access files in a portable way
without worrying at all about encodings. There are files you won't be
able to access, there are conditions you won't be able to deal with.
Sorry, but POSIX sucks and that's life.
You're _more_ screwed if you're trying to access those files in a
portable way without worrying about encodings, and the API you're using
is giving you back invalid, magic path names, with NULLs rather than
being slightly lossy and dropping filenames you (obviously, by virtue of
the way you requested those filenames) won't be able to deal with.
So I was talking here about the default behavior in the case of a naive
program that wants to pretend all paths are unicode.
>>and (B) there are a few nasty corner-cases when dealing with
>>surrogate (i.e. invalid, in UTF-8) code points which I'm not quite
>>sure what it would have done with.
>
>The lone-surrogate-pair proposal was a totally different proposal than
>the U+0000 one.
I wasn't referring to the lone-surrogate-pair encoding trick, I was
referring to the fact that some people are going to want to treat
surrogate pairs as encoding errors (i.e. include the NULL byte) and some
will want to treat them as valid. If you want them to be valid you have
to normalize away the surrogates in order to talk to other software, but
you can't do that because then you'll get different bytes when you re-
encode them.
There's probably a way around that but it would be subtle and
controversial no matter how you did it.
More information about the Python-3000
mailing list