Re: [Python-Dev] Patch for an initial support of bytes filename in Python3
On 12:47 am, victor.stinner@haypocalc.com wrote: This is the most sane contribution I've seen so far :).
See attached patch: python3_bytes_filename.patch
Using the patch, you will get: - open() support bytes - listdir(unicode) -> only unicode, *skip* invalid filenames (as asked by Guido)
Forgive me for being a bit dense, but I couldn't find this hunk in the patch. Do I understand properly that (listdir(bytes) -> bytes)? If so, this seems basically sane to me, since it provides text behavior where possible and allows more sophisticated filesystem wrappers (i.e. Twisted's FilePath, Will McGugan's "FS") to do more tricky things, separating filenames for display to the user and filenames for exchange with the FS.
- remove os.getcwdu() - create os.getcwdb() -> bytes - glob.glob() support bytes - fnmatch.filter() support bytes - posixpath.join() and posixpath.split() support bytes
It sounds like maybe there should be some 2to3 fixers in here somewhere, too? Not necessarily as part of this patch, but somewhere related? I don't know what they would do, but it does seem quite likely that code which was previously correct under 2.6 (using bytes) would suddenly be mixing bytes and unicode with these APIs.
Hi,
This is the most sane contribution I've seen so far :).
Oh thanks.
Do I understand properly that (listdir(bytes) -> bytes)?
Yes, os.listdir(bytes)->bytes. It's already the current behaviour. But with Python3 trunk, os.listdir(str) -> str ... or bytes (if unicode conversion fails).
If so, this seems basically sane to me, since it provides text behavior where possible and allows more sophisticated filesystem wrappers (i.e. Twisted's FilePath, Will McGugan's "FS") to do more tricky things, separating filenames for display to the user and filenames for exchange with the FS.
It's the goal of my patch. Let people do what you want with bytes: rename the file, try the best charset to display the filename, etc.
- remove os.getcwdu() - create os.getcwdb() -> bytes - glob.glob() support bytes - fnmatch.filter() support bytes - posixpath.join() and posixpath.split() support bytes
It sounds like maybe there should be some 2to3 fixers in here somewhere, too?
IMHO a programmer should not use bytes for filenames. Only specific programs used to fix a broken system (eg. convmv program), a backup program, etc. should use bytes. So the "default" type (type and not charset) for filenames should be str in Python3. If my patch would be applied, 2to3 have to replace getcwdu() to getcwd(). That's all.
Not necessarily as part of this patch, but somewhere related? I don't know what they would do, but it does seem quite likely that code which was previously correct under 2.6 (using bytes) would suddenly be mixing bytes and unicode with these APIs.
It looks like 2to3 convert all text '...' or u'...' to unicode (str). So converted programs will use str for filenames. -- Victor Stinner aka haypo http://www.haypocalc.com/blog/
On Tue, Sep 30, 2008 at 6:21 AM, <glyph@divmod.com> wrote:
On 12:47 am, victor.stinner@haypocalc.com wrote:
This is the most sane contribution I've seen so far :).
Thanks. I'll review it later today (after coffee+breakfast :) and will apply it assuming the code is reasonably sane, otherwise I'll go around with Victor until it is to my satisfaction.
See attached patch: python3_bytes_filename.patch
Using the patch, you will get: - open() support bytes - listdir(unicode) -> only unicode, *skip* invalid filenames (as asked by Guido)
Forgive me for being a bit dense, but I couldn't find this hunk in the patch. Do I understand properly that (listdir(bytes) -> bytes)?
If so, this seems basically sane to me, since it provides text behavior where possible and allows more sophisticated filesystem wrappers (i.e. Twisted's FilePath, Will McGugan's "FS") to do more tricky things, separating filenames for display to the user and filenames for exchange with the FS.
- remove os.getcwdu() - create os.getcwdb() -> bytes - glob.glob() support bytes - fnmatch.filter() support bytes - posixpath.join() and posixpath.split() support bytes
It sounds like maybe there should be some 2to3 fixers in here somewhere, too? Not necessarily as part of this patch, but somewhere related? I don't know what they would do, but it does seem quite likely that code which was previously correct under 2.6 (using bytes) would suddenly be mixing bytes and unicode with these APIs.
Doesn't seem easy for 2to3 to recognize such cases. If 2.6 weren't pretty much released already I'd ask to add os.getcwdb() there, as an alias for os.getcwd(), and add a 2to3 fixer that converts os.getcwdu() to os.getcwd(), leaves os.getcwd() alone (benefit of the doubt) and leaves os.getcwdb() alone as well (a strong indication the user meant to get bytes in the 3.x version of their code. (Similar to using bytes instead of str in 2.6 even though they mean the same thing there -- they will be properly separated in 3.x.) -- --Guido van Rossum (home page: http://www.python.org/~guido/)
On Tue, Sep 30, 2008 at 10:59 AM, <glyph@divmod.com> wrote:
On 02:32 pm, guido@python.org wrote:
If 2.6 weren't pretty much released already I'd ask to add os.getcwdb() there, as an alias for os.getcwd(), and add a 2to3 fixer that converts os.getcwdu() to os.getcwd(), leaves os.getcwd() alone (benefit of the doubt) and leaves os.getcwdb() alone as well (a strong indication the user meant to get bytes in the 3.x version of their code. (Similar to using bytes instead of str in 2.6 even though they mean the same thing there -- they will be properly separated in 3.x.)
In the absence of a 2.6 getcwdb, perhaps the fixer could just drop the "benefit of the doubt" case? It could always be added to 2.7, and the parity release of 2to3 could have a --2.7 switch that would modify the behavior of this and other fixers.
I'm not sure what you're proposing. *My* proposal is that 2to3 changes os.getcwdu() calls to os.getcwd() and leaves os.getcwd() calls alone -- there's no way to tell whether os.getcwdb() would be a better match, and for portable code, it won't be (since os.getcwdb() is a Unix-only thing). -- --Guido van Rossum (home page: http://www.python.org/~guido/)
On 02:32 pm, guido@python.org wrote:
On Tue, Sep 30, 2008 at 6:21 AM, <glyph@divmod.com> wrote:
On 12:47 am, victor.stinner@haypocalc.com wrote:
It sounds like maybe there should be some 2to3 fixers in here somewhere, too? Not necessarily as part of this patch, but somewhere related? I don't know what they would do, but it does seem quite likely that code which was previously correct under 2.6 (using bytes) would suddenly be mixing bytes and unicode with these APIs.
Doesn't seem easy for 2to3 to recognize such cases.
Actually I think I'm wrong. As far as dealing with glob(), listdir() and friends, I suppose that other bytes/text fixers will already have had their opportunity to deal with getting the type to be the appropriate thing, and if you have glob(<something that 2to3 understands should be bytes>) it will work as expected in 3.0. (I am really just confirming that I have nothing useful to say here, using too many words to do it: at least, I hope that nobody will waste further time thinking about it as a result.)
If 2.6 weren't pretty much released already I'd ask to add os.getcwdb() there, as an alias for os.getcwd(), and add a 2to3 fixer that converts os.getcwdu() to os.getcwd(), leaves os.getcwd() alone (benefit of the doubt) and leaves os.getcwdb() alone as well (a strong indication the user meant to get bytes in the 3.x version of their code. (Similar to using bytes instead of str in 2.6 even though they mean the same thing there -- they will be properly separated in 3.x.)
In the absence of a 2.6 getcwdb, perhaps the fixer could just drop the "benefit of the doubt" case? It could always be added to 2.7, and the parity release of 2to3 could have a --2.7 switch that would modify the behavior of this and other fixers.
On 05:56 pm, guido@python.org wrote:
On Tue, Sep 30, 2008 at 10:59 AM, <glyph@divmod.com> wrote:
On 02:32 pm, guido@python.org wrote:
In the absence of a 2.6 getcwdb, perhaps the fixer could just drop the "benefit of the doubt" case? It could always be added to 2.7, and the parity release of 2to3 could have a --2.7 switch that would modify the behavior of this and other fixers.
I'm not sure what you're proposing. *My* proposal is that 2to3 changes os.getcwdu() calls to os.getcwd() and leaves os.getcwd() calls alone -- there's no way to tell whether os.getcwdb() would be a better match, and for portable code, it won't be (since os.getcwdb() is a Unix-only thing).
My proposal is simply to change getcwd to getcwdb, and getcwdu to getcwd. This preserves whatever bytes/text behavior you are expecting from 2.6 into 3.0. Granted, the fact that unicode is really always the right thing to do on Windows complicates things. I already tend to avoid os.getcwd() though, and this is just one more reason to avoid it. In the rare cases where I really do need it, it looks like os.path.abspath(b".") / os.path.abspath(u".") will provide the clarity that I want.
On Tue, Sep 30, 2008 at 7:56 PM, Guido van Rossum <guido@python.org> wrote:
(since os.getcwdb() is a Unix-only thing).
I would be happier if all the Unix byte functions existed on Windows fell back to something like encoding the filenames to/from UTF-8. Then at least it would be possible for programs to support reading all files on both Unix and Windows without having to perform some sort of explicit check to see whether os.getcwdb() and friends are supported.
On Tue, Sep 30, 2008 at 11:47 AM, <glyph@divmod.com> wrote:
On 05:56 pm, guido@python.org wrote:
On Tue, Sep 30, 2008 at 10:59 AM, <glyph@divmod.com> wrote:
On 02:32 pm, guido@python.org wrote:
In the absence of a 2.6 getcwdb, perhaps the fixer could just drop the "benefit of the doubt" case? It could always be added to 2.7, and the parity release of 2to3 could have a --2.7 switch that would modify the behavior of this and other fixers.
I'm not sure what you're proposing. *My* proposal is that 2to3 changes os.getcwdu() calls to os.getcwd() and leaves os.getcwd() calls alone -- there's no way to tell whether os.getcwdb() would be a better match, and for portable code, it won't be (since os.getcwdb() is a Unix-only thing).
My proposal is simply to change getcwd to getcwdb, and getcwdu to getcwd. This preserves whatever bytes/text behavior you are expecting from 2.6 into 3.0. Granted, the fact that unicode is really always the right thing to do on Windows complicates things.
Plus, even on Linux Unicode is *usually* what you should be doing, unless you're writing a backup tool.
I already tend to avoid os.getcwd() though, and this is just one more reason to avoid it. In the rare cases where I really do need it, it looks like os.path.abspath(b".") / os.path.abspath(u".") will provide the clarity that I want.
Or os.path.expanduser('~') vs. os.path.expanduser(b'~'). :-) -- --Guido van Rossum (home page: http://www.python.org/~guido/)
On Tue, Sep 30, 2008 at 12:07 PM, Simon Cross <hodgestar+pythondev@gmail.com> wrote:
On Tue, Sep 30, 2008 at 7:56 PM, Guido van Rossum <guido@python.org> wrote:
(since os.getcwdb() is a Unix-only thing).
I would be happier if all the Unix byte functions existed on Windows fell back to something like encoding the filenames to/from UTF-8. Then at least it would be possible for programs to support reading all files on both Unix and Windows without having to perform some sort of explicit check to see whether os.getcwdb() and friends are supported.
Actually on Windows the syscalls use the encoding that Microsoft uses -- when using bytes we use the Windows bytes API and when using str we use the Windows wide API. That's the most platform-compatible approach. -- --Guido van Rossum (home page: http://www.python.org/~guido/)
On Wed, Oct 1, 2008 at 12:05 AM, Guido van Rossum <guido@python.org> wrote:
Actually on Windows the syscalls use the encoding that Microsoft uses -- when using bytes we use the Windows bytes API and when using str we use the Windows wide API. That's the most platform-compatible approach.
Woot. As long as the Python file API is consistent across the two platforms, I'm happy. :)
On Wed, Oct 1, 2008 at 12:04 AM, Guido van Rossum <guido@python.org> wrote:
Plus, even on Linux Unicode is *usually* what you should be doing, unless you're writing a backup tool.
I still find this line of reasoning a bit worrying. Imagine an end user application like a music player. The user discovers that he can't see some .mp3 or .ogg file from the music player that is visibile is the file manager. I would expect him to file a bug on the music player. If the bug was closed with "fix the filename" I imagine the user would respond with "but other programs can access it just fine". I'm not unhappy with the solution Victor is proposing, but I imagine that when I start coding projects in 3.0 I'll default to the bytes versions of the filename methods and use b"path".decode(sys.getfilesystemencoding(), "replace") if I need to get Unicode.
Simon Cross writes:
I still find this line of reasoning a bit worrying. Imagine an end user application like a music player. The user discovers that he can't see some .mp3 or .ogg file from the music player that is visibile is the file manager. I would expect him to file a bug on the music player. If the bug was closed with "fix the filename" I imagine the user would respond with "but other programs can access it just fine".
And the user would very likely be *wrong*. The file manager is displaying it, but in the nature of things file managers *don't access files*, they access *directories*. The files they pass to other apps to access. That's precisely the kind of situation that Georg Brandl was describing with OpenOffice.
I'm not unhappy with the solution Victor is proposing, but I imagine that when I start coding projects in 3.0 I'll default to the bytes versions of the filename methods and use b"path".decode(sys.getfilesystemencoding(), "replace") if I need to get Unicode.
But now the user will file a bug because in the file opening dialog they can't *read* their Chinese file names on their USB key because they are appearing in (system encoding) Cyrillic. Do you begin to see the nature of the Catch-22 here? I don't expect the user to be very sympathetic when you tell her to fix the filenames, but it's not as easy as you would think to get this right.
On Wed, Oct 1, 2008 at 12:25 PM, Stephen J. Turnbull <stephen@xemacs.org> wrote:
Simon Cross writes:
I still find this line of reasoning a bit worrying. Imagine an end user application like a music player. The user discovers that he can't see some .mp3 or .ogg file from the music player that is visibile is the file manager. I would expect him to file a bug on the music player. If the bug was closed with "fix the filename" I imagine the user would respond with "but other programs can access it just fine".
And the user would very likely be *wrong*. The file manager is displaying it, but in the nature of things file managers *don't access files*, they access *directories*. The files they pass to other apps to access.
Exactly the same reasoning applies to files in a directory with an odd name.
I'm not unhappy with the solution Victor is proposing, but I imagine that when I start coding projects in 3.0 I'll default to the bytes versions of the filename methods and use b"path".decode(sys.getfilesystemencoding(), "replace") if I need to get Unicode.
But now the user will file a bug because in the file opening dialog they can't *read* their Chinese file names on their USB key because they are appearing in (system encoding) Cyrillic. Do you begin to see the nature of the Catch-22 here?
I don't expect the user to be very sympathetic when you tell her to fix the filenames, but it's not as easy as you would think to get this right.
a) There is some chance that at least ASCII characters will be displayed correctly if getfilesystemencoding() is similar to the encoding used and corrupted filenames will display correctly except for corrupted characters. b) The user will at least be able to access the file. It's a more graceful degredation of functionality than not being able to work with the file at all. Schiavo Simon
Simon Cross writes:
a) There is some chance that at least ASCII characters will be displayed correctly if getfilesystemencoding() is similar to the encoding used and corrupted filenames will display correctly except for corrupted characters.
All you're saying is that the cases *you* can imagine running into work better. All I'm saying is the opposite. We're both right; the point is that that means that Python can't be, not all of the time. We know from experience (Emacs/Mule, Java) that trying to impose a theoretical system on encoding just doesn't work by itself[1], and in fact creates other problems by its very rigidity. I'd like to see Python not fall into that trap, too. Footnotes: [1] It needs system-level support as in Windows and Mac OS X.
On Wed, Oct 1, 2008 at 1:05 AM, Simon Cross <hodgestar+pythondev@gmail.com> wrote:
On Wed, Oct 1, 2008 at 12:04 AM, Guido van Rossum <guido@python.org> wrote:
Plus, even on Linux Unicode is *usually* what you should be doing, unless you're writing a backup tool.
I still find this line of reasoning a bit worrying. Imagine an end user application like a music player. The user discovers that he can't see some .mp3 or .ogg file from the music player that is visibile is the file manager. I would expect him to file a bug on the music player. If the bug was closed with "fix the filename" I imagine the user would respond with "but other programs can access it just fine".
I see nothing wrong with this scenario. If undecodable filenames are a common thing then the authors of the music player should be using the bytes variant of the API, and if they get enough bugs like this they will fix their code to do so. OTOH if this is not common the response "rename the file" is totally reasonable -- you have to prioritize your bugs or else you'll never get any software released, and the occasional work-around is a given. -- --Guido van Rossum (home page: http://www.python.org/~guido/)
participants (5)
-
glyph@divmod.com
-
Guido van Rossum
-
Simon Cross
-
Stephen J. Turnbull
-
Victor Stinner