On 30 Sep, 09:22 pm, guido@python.org wrote:
On Tue, Sep 30, 2008 at 1:04 PM, "Martin v. L�wis" <martin@v.loewis.de> wrote:
Guido van Rossum wrote:
On Mon, Sep 29, 2008 at 11:00 PM, "Martin v. L�wis" <martin@v.loewis.de> wrote:
Martin, I don't understand why you are in favor of storing raw bytes encoded as Latin-1 in Unicode string objects, which clearly gives rise to mojibake.
This is my word of the day, by the way. Reading this whole thread was _totally_ worth it to learn about "mojibake". Obviously I'm familiar with the phenomenon but somehow I'd never heard this awesome term before.
I am also encouraged by Glyph's support for (a). He has a lot of practical experience.
Thanks for the vote of confidence. I hope for all our sakes that you're not over-valuing that experience ;-). For what it's worth, I can see MvL's point in that I think there is some danger in generating confusion by adding _too many_ string-like functions to the bytes type. I don't want my suggestion to contribute to the confusion between bytes and text. However, Martin, I can promise you that I will _never_ ask for any convenience functions related to bytes as a result of this decision. I want bytes to come back from filesystem APIs because I intend to have a wrapper layer which knows two things about the file: the bytes (which are needed to talk to POSIX filesystem APIs) and the characters (which are computed from those bytes, can be safely renormalized, displayed to users, etc). On Windows this filesystem wrapper will necessarily behave differently, and will not use bytes for anything. Any formatting beyond joining path segments together and possibly splitting extensions off will be done on character strings, not byte strings. The proposal of using U+0000 seems like it would have been almost the same from such a wrapper's perspective, except (A) people using the filesystem APIs without the benefit of such a wrapper would have been even more screwed, and (B) there are a few nasty corner-cases when dealing with surrogate (i.e. invalid, in UTF-8) code points which I'm not quite sure what it would have done with. Guido already mentioned "libraries" as a hypothetical issue, but here's a real-world problem that results from putting NULLs into filenames. Consider this program: import gtk w = gtk.Window() b = gtk.Button(u"\u0000/hello/world") w.add(b) w.show_all() gtk.main() which emits this message: TypeError: OGtkButton.__init__() argument 1 must be string without null bytes or None, not unicode SQLite has a similar problem with NULLs, and I'm definitely sticking paths in there, too. Eventually I'd like to propose such a path type for inclusion in the stdlib, but that will have to wait for issues like <http://twistedmatrix.com/trac/ticket/2366> to be resolved.