[Python-Dev] [Python-3000] New proposition for Python3 bytes filename issue

Wed Oct 1 05:32:04 CEST 2008

On Sep 30, 2008, at 10:06 PM, glyph at divmod.com wrote:
> However, Martin, I can promise you that I will _never_ ask for any  
> convenience functions related to bytes as a result of this  
> decision.  I want bytes to come back from filesystem APIs because I  
> intend to have a wrapper layer which knows two things about the  
> file: the bytes (which are needed to talk to POSIX filesystem APIs)  
> and the characters (which are computed from those bytes, can be  
> safely renormalized, displayed to users, etc).  On Windows this  
> filesystem wrapper will necessarily behave differently, and will not  
> use bytes for anything.  Any formatting beyond joining path segments  
> together and possibly splitting extensions off will be done on  
> character strings, not byte strings.

Can you clarify what proposal you are supporting for Python:

1) Two sets of APIs, one returning unicode strings, and one returning  
bytestrings. (subpoints: what does the unicode-returning API do when  
it cannot decode the bytestring into unicode? raise exception, pretend  
argument/envvar/file didn't exist/?)

or

2) All APIs return bytestrings only. Converting to unicode is  
considered lossy, and would have to be done by applications for  
display purposes only.

I really don't understand the reasoning for (1). It seems to me that  
most software (probably including all of the Python stdlib) would  
continue to use the unicode string API. Switching all of the Python  
stdlib to use the bytestring APIs instead would certainly be a large  
undertaking, and would have all sorts of ripple-on API changes (e.g.  
__file__). So I can only imagine that if you're proposing (1), you're  
doing so without the intention of suggesting that Python be converted  
to use it.

And so, of course, that doesn't really fix things (such as getcwd  
failing if your cwd is a path that is undecodeable in the current  
locale, or well, currently, python refusing to even start).

If you're proposing (2), it's at least as large an undertaking as (1)  
+ converting Python to use the optional bytestring APIs. But at least  
it avoids exposing an API that people ought not use, and does make it  
obvious what still needs to be fixed: the unfixed code simply won't  
run at all.

> The proposal of using U+0000 seems like it would have been almost  
> the same from such a wrapper's perspective, except (A) people using  
> the filesystem APIs without the benefit of such a wrapper would have  
> been even more screwed

I'm not sure what your "more screwed" is comparing against: current  
py3k behavior? (aka: decoding to Unicode in locale's specified  
encoding)? I don't see how you can really be more screwed than that:  
not only can't you send your filename to display in a Gtk+ button, you  
can't access it at all, even staying within python.

> and (B) there are a few nasty corner-cases when dealing with  
> surrogate (i.e. invalid, in UTF-8) code points which I'm not quite  
> sure what it would have done with.

The lone-surrogate-pair proposal was a totally different proposal than  
the U+0000 one.

James