[Python-Dev] [Python-3000] New proposition for Python3 bytes filename issue

Wed Oct 1 04:06:25 CEST 2008

On 30 Sep, 09:22 pm, guido at python.org wrote:
>On Tue, Sep 30, 2008 at 1:04 PM, "Martin v. Löwis" <martin at v.loewis.de> 
>wrote:
>>Guido van Rossum wrote:
>>>On Mon, Sep 29, 2008 at 11:00 PM, "Martin v. Löwis" 
>>><martin at v.loewis.de> wrote:

>>>Martin, I don't understand why you are in favor of storing raw bytes
>>>encoded as Latin-1 in Unicode string objects, which clearly gives 
>>>rise
>>>to mojibake.

This is my word of the day, by the way.  Reading this whole thread was 
_totally_ worth it to learn about "mojibake".  Obviously I'm familiar 
with the phenomenon but somehow I'd never heard this awesome term 
before.
>I am also encouraged by Glyph's support for (a). He has a lot of
>practical experience.

Thanks for the vote of confidence.  I hope for all our sakes that you're 
not over-valuing that experience ;-).

For what it's worth, I can see MvL's point in that I think there is some 
danger in generating confusion by adding _too many_ string-like 
functions to the bytes type.  I don't want my suggestion to contribute 
to the confusion between bytes and text.

However, Martin, I can promise you that I will _never_ ask for any 
convenience functions related to bytes as a result of this decision.  I 
want bytes to come back from filesystem APIs because I intend to have a 
wrapper layer which knows two things about the file: the bytes (which 
are needed to talk to POSIX filesystem APIs) and the characters (which 
are computed from those bytes, can be safely renormalized, displayed to 
users, etc).  On Windows this filesystem wrapper will necessarily behave 
differently, and will not use bytes for anything.  Any formatting beyond 
joining path segments together and possibly splitting extensions off 
will be done on character strings, not byte strings.

The proposal of using U+0000 seems like it would have been almost the 
same from such a wrapper's perspective, except (A) people using the 
filesystem APIs without the benefit of such a wrapper would have been 
even more screwed, and (B) there are a few nasty corner-cases when 
dealing with surrogate (i.e. invalid, in UTF-8) code points which I'm 
not quite sure what it would have done with.

Guido already mentioned "libraries" as a hypothetical issue, but here's 
a real-world problem that results from putting NULLs into filenames. 
Consider this program:

    import gtk
    w = gtk.Window()
    b = gtk.Button(u"\u0000/hello/world")
    w.add(b)
    w.show_all()
    gtk.main()

which emits this message:
    TypeError: OGtkButton.__init__() argument 1 must be string without 
null bytes or None, not unicode

SQLite has a similar problem with NULLs, and I'm definitely sticking 
paths in there, too.

Eventually I'd like to propose such a path type for inclusion in the 
stdlib, but that will have to wait for issues like 
<http://twistedmatrix.com/trac/ticket/2366> to be resolved.