[Python-3000] [Python-Dev] New proposition for Python3 bytes filename issue
glyph at divmod.com
glyph at divmod.com
Wed Oct 1 04:06:25 CEST 2008
On 30 Sep, 09:22 pm, guido at python.org wrote:
>On Tue, Sep 30, 2008 at 1:04 PM, "Martin v. Löwis" <martin at v.loewis.de>
>wrote:
>>Guido van Rossum wrote:
>>>On Mon, Sep 29, 2008 at 11:00 PM, "Martin v. Löwis"
>>><martin at v.loewis.de> wrote:
>>>Martin, I don't understand why you are in favor of storing raw bytes
>>>encoded as Latin-1 in Unicode string objects, which clearly gives
>>>rise
>>>to mojibake.
This is my word of the day, by the way. Reading this whole thread was
_totally_ worth it to learn about "mojibake". Obviously I'm familiar
with the phenomenon but somehow I'd never heard this awesome term
before.
>I am also encouraged by Glyph's support for (a). He has a lot of
>practical experience.
Thanks for the vote of confidence. I hope for all our sakes that you're
not over-valuing that experience ;-).
For what it's worth, I can see MvL's point in that I think there is some
danger in generating confusion by adding _too many_ string-like
functions to the bytes type. I don't want my suggestion to contribute
to the confusion between bytes and text.
However, Martin, I can promise you that I will _never_ ask for any
convenience functions related to bytes as a result of this decision. I
want bytes to come back from filesystem APIs because I intend to have a
wrapper layer which knows two things about the file: the bytes (which
are needed to talk to POSIX filesystem APIs) and the characters (which
are computed from those bytes, can be safely renormalized, displayed to
users, etc). On Windows this filesystem wrapper will necessarily behave
differently, and will not use bytes for anything. Any formatting beyond
joining path segments together and possibly splitting extensions off
will be done on character strings, not byte strings.
The proposal of using U+0000 seems like it would have been almost the
same from such a wrapper's perspective, except (A) people using the
filesystem APIs without the benefit of such a wrapper would have been
even more screwed, and (B) there are a few nasty corner-cases when
dealing with surrogate (i.e. invalid, in UTF-8) code points which I'm
not quite sure what it would have done with.
Guido already mentioned "libraries" as a hypothetical issue, but here's
a real-world problem that results from putting NULLs into filenames.
Consider this program:
import gtk
w = gtk.Window()
b = gtk.Button(u"\u0000/hello/world")
w.add(b)
w.show_all()
gtk.main()
which emits this message:
TypeError: OGtkButton.__init__() argument 1 must be string without
null bytes or None, not unicode
SQLite has a similar problem with NULLs, and I'm definitely sticking
paths in there, too.
Eventually I'd like to propose such a path type for inclusion in the
stdlib, but that will have to wait for issues like
<http://twistedmatrix.com/trac/ticket/2366> to be resolved.
More information about the Python-3000
mailing list