Re: [Python-Dev] [Python-3000] New proposition for Python3 bytes filename issue
On Mon, Sep 29, 2008 at 11:22 PM, Georg Brandl <g.brandl@gmx.net> wrote:
No, that was not what I meant (although it is another possibility). As I wrote, Martin's proposal that I support here is using the modified UTF-8 codec that successfully roundtrips otherwise invalid UTF-8 data.
I thought that the "successful rountripping" pretty much stopped as soon as the unicode data is exported to somewhere else -- doesn't it contain invalid surrogate sequences? In general, I'm very reluctant to use utf-8b given that it doesn't seem to be well documented as a standard anywhere. Providing some minimal APIs that can process raw-bytes filenames still makes more sense -- it is mostly analogous of our treatment of text files, where the underlying binary data is also accessible.
You seem to forget that (disregarding OSX here, since it already enforces UTF-8) the majority of file names on Posix systems will be encoded correctly.
Apparently under certain circumstances (external FS mounted) OSX can also have non-UTF-8 filenames. [...]
With the filenames decoded by UTF-8, your files named têste, ô, dossié will be displayed and handled correctly. The others are *invalid* in the filesystem encoding UTF-8 and therefore would be represented by something like
u'dir\uXXffname' where XX is some private use Unicode namespace. It won't look pretty when printed, but then, what do other applications do? They e.g. display a question mark as you show above, which is not better in terms of readability.
But it will work when given to a filename-handling function. Valid filenames can be compared to Unicode strings.
A real-world example: OpenOffice can't open files with invalid bytes in their name. They are displayed in the "Open file" dialog, but trying to open fails. This regularly drives me crazy. Let's not make Python not work this way too, or, even worse, not even display those filenames.
How can it *regularly* drive you crazy when "the majority of fie names [...] encoded correctly" (as you assert above)? -- --Guido van Rossum (home page: http://www.python.org/~guido/)
Guido van Rossum schrieb:
With the filenames decoded by UTF-8, your files named têste, ô, dossié will be displayed and handled correctly. The others are *invalid* in the filesystem encoding UTF-8 and therefore would be represented by something like
u'dir\uXXffname' where XX is some private use Unicode namespace. It won't look pretty when printed, but then, what do other applications do? They e.g. display a question mark as you show above, which is not better in terms of readability.
But it will work when given to a filename-handling function. Valid filenames can be compared to Unicode strings.
A real-world example: OpenOffice can't open files with invalid bytes in their name. They are displayed in the "Open file" dialog, but trying to open fails. This regularly drives me crazy. Let's not make Python not work this way too, or, even worse, not even display those filenames.
How can it *regularly* drive you crazy when "the majority of fie names [...] encoded correctly" (as you assert above)?
Because Office files are a) often named with long, seemingly descriptive filenames, which invariably means umlauts in German, and b) often sent around between systems, creating encoding problems. Having seen how much controversy returning an invalid Unicode string sparks, and given that it really isn't obvious to the newbie either, I think I now agree that dropping filenames when calling a listdir() that returns Unicode filenames is the best solution. I'm a little uneasy with having one function for both bytes and Unicode return, because that kind of str/unicode mixing I thought we had left behind in 2.x, but of course can live with it. Georg -- Thus spake the Lord: Thou shalt indent with four spaces. No more, no less. Four shall be the number of spaces thou shalt indent, and the number of thy indenting shall be four. Eight shalt thou not indent, nor either indent thou two, excepting that thou then proceed to four. Tabs are right out.
On Tue, Sep 30, 2008 at 10:28 AM, Georg Brandl <g.brandl@gmx.net> wrote:
How can it *regularly* drive you crazy when "the majority of fie names [...] encoded correctly" (as you assert above)?
Because Office files are a) often named with long, seemingly descriptive filenames, which invariably means umlauts in German, and b) often sent around between systems, creating encoding problems.
Gotcha.
Having seen how much controversy returning an invalid Unicode string sparks, and given that it really isn't obvious to the newbie either, I think I now agree that dropping filenames when calling a listdir() that returns Unicode filenames is the best solution. I'm a little uneasy with having one function for both bytes and Unicode return, because that kind of str/unicode mixing I thought we had left behind in 2.x, but of course can live with it.
Well, the *current* Py3k behavior where it may return a mix of bytes and str instances is really messy, and likely to trip up most code that doesn't expect it in a way that makes it hard to debug. However the *proposed* behavior (returns bytes if the arg was bytes, and returns str when the arg was str) is IMO sane, and no different than the polymorphism found in len() or many builtin operations. -- --Guido van Rossum (home page: http://www.python.org/~guido/)
Guido van Rossum schrieb:
On Tue, Sep 30, 2008 at 10:28 AM, Georg Brandl <g.brandl@gmx.net> wrote:
How can it *regularly* drive you crazy when "the majority of fie names [...] encoded correctly" (as you assert above)?
Because Office files are a) often named with long, seemingly descriptive filenames, which invariably means umlauts in German, and b) often sent around between systems, creating encoding problems.
Gotcha.
Which means?
Having seen how much controversy returning an invalid Unicode string sparks, and given that it really isn't obvious to the newbie either, I think I now agree that dropping filenames when calling a listdir() that returns Unicode filenames is the best solution. I'm a little uneasy with having one function for both bytes and Unicode return, because that kind of str/unicode mixing I thought we had left behind in 2.x, but of course can live with it.
Well, the *current* Py3k behavior where it may return a mix of bytes and str instances is really messy, and likely to trip up most code that doesn't expect it in a way that makes it hard to debug. However the *proposed* behavior (returns bytes if the arg was bytes, and returns str when the arg was str) is IMO sane, and no different than the polymorphism found in len() or many builtin operations.
I agree that everything is better than the current behavior. Georg -- Thus spake the Lord: Thou shalt indent with four spaces. No more, no less. Four shall be the number of spaces thou shalt indent, and the number of thy indenting shall be four. Eight shalt thou not indent, nor either indent thou two, excepting that thou then proceed to four. Tabs are right out.
Guido van Rossum wrote:
However the *proposed* behavior (returns bytes if the arg was bytes, and returns str when the arg was str) is IMO sane, and no different than the polymorphism found in len() or many builtin operations.
My concern still is that it brings the bytes type into the status of another character string type, which is really bad, and will require further modifications to Python for the lifetime of 3.x. This is because applications will then regularly use byte strings for file names on Unix, and regular strings on Windows, and then expect the program to work the same without further modifications. The next question then will be environment variables and command line arguments, for which we then should provide two versions (e.g. sys.argv and sys.argvb; for os.environ, os.environ["PATH"] could mean something different from os.environ[b"PATH"]). And so on (passwd/group file, Tkinter, ...) Regards, Martin
On Tue, Sep 30, 2008 at 1:29 PM, "Martin v. Löwis" <martin@v.loewis.de> wrote:
Guido van Rossum wrote:
However the *proposed* behavior (returns bytes if the arg was bytes, and returns str when the arg was str) is IMO sane, and no different than the polymorphism found in len() or many builtin operations.
My concern still is that it brings the bytes type into the status of another character string type, which is really bad, and will require further modifications to Python for the lifetime of 3.x.
I'd like to understand why this is "really bad". I though it was by design that the str and bytes types behave pretty similarly. You can use both as dict keys.
This is because applications will then regularly use byte strings for file names on Unix, and regular strings on Windows, and then expect the program to work the same without further modifications.
It seems that bytes arguments actually *do* work on Windows -- somehow they get decoded. (Unless Terry's report was from 2.x.)
The next question then will be environment variables and command line arguments, for which we then should provide two versions (e.g. sys.argv and sys.argvb; for os.environ, os.environ["PATH"] could mean something different from os.environ[b"PATH"]).
Actually something like that may not be a bad idea. Ian Bicking's webob supports similar double APIs for getting the request parameters out of a request object; I believe request.GET['x'] is a text object and request.GET_str['x'] is the corresponding uninterpreted bytes sequence. I would prefer to have os.environb over os.environ[b"PATH"] though.
And so on (passwd/group file, Tkinter, ...)
I assume at some point we can stop and have sufficiently low-level interfaces that everyone can agree are in bytes only. Bytes aren't going away. How does Java deal with this? Its File class doesn't seem to deal in bytes at all. What would its listFiles() method do with undecodable filenames? -- --Guido van Rossum (home page: http://www.python.org/~guido/)
My concern still is that it brings the bytes type into the status of another character string type, which is really bad, and will require further modifications to Python for the lifetime of 3.x.
I'd like to understand why this is "really bad". I though it was by design that the str and bytes types behave pretty similarly. You can use both as dict keys.
If they have to behave pretty similarly, they have to be supported in all APIs that deal with text. For example, people will demand that printing bytes should just copy them onto the stream (rather than invoking repr()), and writing them onto a text stream should work the same way. GUI library should support them, the XML libraries, and so on. Where will you stop, and tell people that bytes are just not supposed to do this or that?
This is because applications will then regularly use byte strings for file names on Unix, and regular strings on Windows, and then expect the program to work the same without further modifications.
It seems that bytes arguments actually *do* work on Windows -- somehow they get decoded. (Unless Terry's report was from 2.x.)
To a limited degree - see my other message. Don't try to listdir a directory with characters outside CP_ACP (it will give you invalid file names).
Actually something like that may not be a bad idea. Ian Bicking's webob supports similar double APIs for getting the request parameters out of a request object; I believe request.GET['x'] is a text object and request.GET_str['x'] is the corresponding uninterpreted bytes sequence. I would prefer to have os.environb over os.environ[b"PATH"] though.
And would you keep them synchronized?
I assume at some point we can stop and have sufficiently low-level interfaces that everyone can agree are in bytes only. Bytes aren't going away. How does Java deal with this? Its File class doesn't seem to deal in bytes at all. What would its listFiles() method do with undecodable filenames?
Apparently (JDK 1.5.0_16, on Linux), it decodes undecodable bytes/byte sequences as U+FFFD (REPLACEMENT CHARACTER). Opening such a file will fail with FileNotFoundException. IOW, Java hasn't solved the problem in the last 10 years. Marcin Kowalczyk did a more thorough analysis about a year ago in http://mail.python.org/pipermail/python-3000/2007-September/010450.html Regards, Martin
On Tue, Sep 30, 2008 at 3:21 PM, "Martin v. Löwis" <martin@v.loewis.de> wrote:
My concern still is that it brings the bytes type into the status of another character string type, which is really bad, and will require further modifications to Python for the lifetime of 3.x.
I'd like to understand why this is "really bad". I though it was by design that the str and bytes types behave pretty similarly. You can use both as dict keys.
If they have to behave pretty similarly, they have to be supported in all APIs that deal with text.
I don't see how you get from "pretty similarly" to "all APIs". :-)
For example, people will demand that printing bytes should just copy them onto the stream (rather than invoking repr()), and writing them onto a text stream should work the same way. GUI library should support them, the XML libraries, and so on.
Where will you stop, and tell people that bytes are just not supposed to do this or that?
Printing a bytes object already works, and displays its repr(), which is guaranteed to be pure ASCII (unlike the repr() of a unicode str object in Py3k). All the others you mention will cause breakage as they should -- these errors exist to force the programmer to think about encodings or conversions. I don't see that as a big burden because the only way there could be bytes here in the first place is when the user explicitly requested bytes. A program that only ever passes text strings to the os module is only ever going to get text strings back.
This is because applications will then regularly use byte strings for file names on Unix, and regular strings on Windows, and then expect the program to work the same without further modifications.
It seems that bytes arguments actually *do* work on Windows -- somehow they get decoded. (Unless Terry's report was from 2.x.)
To a limited degree - see my other message. Don't try to listdir a directory with characters outside CP_ACP (it will give you invalid file names).
Understood.
Actually something like that may not be a bad idea. Ian Bicking's webob supports similar double APIs for getting the request parameters out of a request object; I believe request.GET['x'] is a text object and request.GET_str['x'] is the corresponding uninterpreted bytes sequence. I would prefer to have os.environb over os.environ[b"PATH"] though.
And would you keep them synchronized?
Yes, the bytes versions would be the canonical version and the str version would wrap around that -- though updating the str version would also update the bytes version. Some keys would be missing from the str version (or perhaps they would raise exceptions or default to some other error handler, like ignore or replace).
I assume at some point we can stop and have sufficiently low-level interfaces that everyone can agree are in bytes only. Bytes aren't going away. How does Java deal with this? Its File class doesn't seem to deal in bytes at all. What would its listFiles() method do with undecodable filenames?
Apparently (JDK 1.5.0_16, on Linux), it decodes undecodable bytes/byte sequences as U+FFFD (REPLACEMENT CHARACTER). Opening such a file will fail with FileNotFoundException.
IOW, Java hasn't solved the problem in the last 10 years. Marcin Kowalczyk did a more thorough analysis about a year ago in
http://mail.python.org/pipermail/python-3000/2007-September/010450.html
I can't say I like the Java solution. I would like to be able to write a robust backup tool in Python, even if the code needed to make it work everywhere isn't going to win any prizes (due to the need to use bytes on Unix, str on Windows). -- --Guido van Rossum (home page: http://www.python.org/~guido/)
On Sep 30, 2008, at 6:21 PM, Martin v. Löwis wrote:
IOW, Java hasn't solved the problem in the last 10 years.
Java is already really bad at being a small little language to write cooperating tools in. I'd never even attempt to write a little pipeline filter in Java -- I've already pretty much learned to expect Java applications to be in their own world, so I'd hardly find it surprising if a Java app could only read files it wrote itself, nevermind files in odd encodings. Python, on the other hand, is an awesome tool for writing small little scripts that interact well with the surrounding environment, Just The Way It Is, without trying to layer so much abstraction upon it so that you lose functionality. Moving away from that would be unfortunate. James
participants (4)
-
"Martin v. Löwis"
-
Georg Brandl
-
Guido van Rossum
-
James Y Knight