Re: [Python-Dev] [Python-3000] New proposition for Python3 bytes filename issue
Victor Stinner schrieb:
(Thanks Victor for moving this to the list. Having a discussion in the tracker is really painful, I find.)
POSIX OS --------
The default behaviour should be to use unicode and raise an error if conversion to unicode fails. It should also be possible to use bytes using bytes arguments and optional arguments (for getcwd).
- listdir(unicode) -> unicode and raise an error on invalid filename
I know I keep flipflopping on this one, but the more I think about it the more I believe it is better to drop those names than to raise an exception. Otherwise a "naive" program that happens to use os.listdir() can be rendered completely useless by a single non-UTF-8 filename. Consider the use of os.listdir() by the glob module. If I am globbing for *.py, why should the presence of a file named b'\xff' cause it to fail? Robust programs using os.listdir() should use the bytes->bytes version.
- listdir(bytes) -> bytes - getcwd() -> unicode - getcwd(bytes=True) -> bytes - open(): accept bytes or unicode
os.path.*() should accept operations on bytes filenames, but maybe not on bytes+unicode arguments. os.path.join('directory', b'filename'): raise an error (or use *implicit* conversion to bytes)?
(Yeah, it should be all bytes or all strings.) On Mon, Sep 29, 2008 at 9:45 AM, Georg Brandl <g.brandl@gmx.net> wrote:
This approach (changing all path-handling functions to accept either bytes or string, but not both) is doomed in my eyes. First, there are lots of them, second, they are not only in os.path but in many modules and also in user code, and third, I see no clean way of implementing them in the specified way. (Just try to do it with os.path.join as an example; I couldn't find the good way to write it, only the bad and the ugly...)
It doesn't have to be supported for all operations -- just enough to be able to access all the system calls. and do the most basic pathname manipulations (split and join -- almost everything else can be built out of those).
If I had to choose, I'd still argue for the modified UTF-8 as filesystem encoding (if it were UTF-8 otherwise), despite possible surprises when a such-encoded filename escapes from Python.
I'm having a hard time finding info about UTF-8b. Does anyone have a decent link? I noticed that OSX has a different approach yet. I believe it insists on valid UTF-8 filenames. It may even require some normalization but I don't know if the kernel enforces this. I tried to create a file named b'\xff' and it came out as %ff. Then "rm %ff" worked. So I think it may be replacing all bad UTF8 sequences with their % encoding. The "set filesystem encoding to be Latin-1" approach has a certain charm as well, but clearly would be a mistake on OSX, and probably on other systems too (whenever the user doesn't think in Latin-1). -- --Guido van Rossum (home page: http://www.python.org/~guido/)
On Mon, Sep 29, 2008 at 11:06 AM, Guido van Rossum <guido@python.org> wrote:
On Mon, Sep 29, 2008 at 9:45 AM, Georg Brandl <g.brandl@gmx.net> wrote:
This approach (changing all path-handling functions to accept either bytes or string, but not both) is doomed in my eyes. First, there are lots of them, second, they are not only in os.path but in many modules and also in user code, and third, I see no clean way of implementing them in the specified way. (Just try to do it with os.path.join as an example; I couldn't find the good way to write it, only the bad and the ugly...)
It doesn't have to be supported for all operations -- just enough to be able to access all the system calls. and do the most basic pathname manipulations (split and join -- almost everything else can be built out of those).
If I had to choose, I'd still argue for the modified UTF-8 as filesystem encoding (if it were UTF-8 otherwise), despite possible surprises when a such-encoded filename escapes from Python.
I'm having a hard time finding info about UTF-8b. Does anyone have a decent link?
http://mail.nl.linux.org/linux-utf8/2000-07/msg00040.html Scroll down to item D, near the bottom. It turns malformed bytes into lone (therefor malformed) surrogates.
I noticed that OSX has a different approach yet. I believe it insists on valid UTF-8 filenames. It may even require some normalization but I don't know if the kernel enforces this. I tried to create a file named b'\xff' and it came out as %ff. Then "rm %ff" worked. So I think it may be replacing all bad UTF8 sequences with their % encoding.
I suspect linux will eventually take this route as well. If ext3 had an option for UTF-8 validation I know I'd want it on. That'd move the error to the program creating bogus file names, rather than those trying to read, display, and manage them.
The "set filesystem encoding to be Latin-1" approach has a certain charm as well, but clearly would be a mistake on OSX, and probably on other systems too (whenever the user doesn't think in Latin-1).
Aye, it's a better hack than UTF-8b, but adding byte functions is even better. -- Adam Olsen, aka Rhamphoryncus
On Sep 29, 2008, at 6:17 PM, Adam Olsen wrote:
I suspect linux will eventually take this route as well. If ext3 had an option for UTF-8 validation I know I'd want it on. That'd move the error to the program creating bogus file names, rather than those trying to read, display, and manage them.
Of course, even on Mac OS X, or a theoretical UTF-8-enforcing ext3, random byte strings are still possible in your program's argv, in environment variables, and as arguments to subprocesses. So python still needs to do something... James
Le Monday 29 September 2008 19:06:01 Guido van Rossum, vous avez écrit :
- listdir(unicode) -> unicode and raise an error on invalid filename
I know I keep flipflopping on this one, but the more I think about it the more I believe it is better to drop those names than to raise an exception. Otherwise a "naive" program that happens to use os.listdir() can be rendered completely useless by a single non-UTF-8 filename. Consider the use of os.listdir() by the glob module. If I am globbing for *.py, why should the presence of a file named b'\xff' cause it to fail?
It would be hard for a newbie programmer to understand why he's unable to find his very important file ("important r?port.doc") using os.listdir(). And yes, if your file system is broken, glob(<unicode>) will fail. If we choose to support bytes on Linux, a robust and portable program have to use only bytes filenames on Linux to always be able to list and open files. A full example to list files and display filenames: import os import os.path import sys if os.path.supports_unicode_filenames: cwd = getcwd() else: cwd = getcwdb() encoding = sys.getfilesystemencoding() for filename in os.listdir(cwd): if os.path.supports_unicode_filenames: text = str(filename, encoding, "replace) else: text = filename print("=== File {0} ===".format(text)) for line in open(filename): ... We need an "if" to choose the directory. The second "if" is only needed to display the filename. Using bytes, it would be possible to write better code detect the real charset (eg. ISO-8859-1 in a UTF-8 file system) and so display correctly the filename and/or propose to rename the file. Would it possible using UTF-8b / PUA hacks? -- Victor Stinner aka haypo http://www.haypocalc.com/blog/
On Mon, Sep 29, 2008 at 4:29 PM, Victor Stinner <victor.stinner@haypocalc.com> wrote:
Le Monday 29 September 2008 19:06:01 Guido van Rossum, vous avez écrit :
- listdir(unicode) -> unicode and raise an error on invalid filename
I know I keep flipflopping on this one, but the more I think about it the more I believe it is better to drop those names than to raise an exception. Otherwise a "naive" program that happens to use os.listdir() can be rendered completely useless by a single non-UTF-8 filename. Consider the use of os.listdir() by the glob module. If I am globbing for *.py, why should the presence of a file named b'\xff' cause it to fail?
It would be hard for a newbie programmer to understand why he's unable to find his very important file ("important r?port.doc") using os.listdir().
*Every* failure in this scenario will be hard to understand for a newbie programmer. We can just document the fact.
And yes, if your file system is broken, glob(<unicode>) will fail.
Why should it?
If we choose to support bytes on Linux, a robust and portable program have to use only bytes filenames on Linux to always be able to list and open files.
Right. But such robustness is only needed to support certain odd cases and we cannot demand that most people bother to write robust code all the time.
A full example to list files and display filenames:
import os import os.path import sys if os.path.supports_unicode_filenames:
This is backwards -- the Unicode API is always supported, the bytes API only on Linux (and perhaps some other other Unixes).
cwd = getcwd() else: cwd = getcwdb() encoding = sys.getfilesystemencoding() for filename in os.listdir(cwd): if os.path.supports_unicode_filenames: text = str(filename, encoding, "replace) else: text = filename print("=== File {0} ===".format(text)) for line in open(filename): ...
We need an "if" to choose the directory. The second "if" is only needed to display the filename. Using bytes, it would be possible to write better code detect the real charset (eg. ISO-8859-1 in a UTF-8 file system) and so display correctly the filename and/or propose to rename the file. Would it possible using UTF-8b / PUA hacks?
-- --Guido van Rossum (home page: http://www.python.org/~guido/)
Guido van Rossum writes:
On Mon, Sep 29, 2008 at 4:29 PM, Victor Stinner <victor.stinner@haypocalc.com> wrote:
It would be hard for a newbie programmer to understand why he's unable to find his very important file ("important r?port.doc") using os.listdir().
*Every* failure in this scenario will be hard to understand for a newbie programmer. We can just document the fact.
Guido is absolutely right. The Emacs/Mule people have been trying to solve this kind of problem for 20 years, and the best they've come up with is Martin's strategy: if you need really robust decoding, force ISO 8859/1 (which for historical reasons uses all 256 octets) to get a lossless internal text representation, and decode from that and *track the encoding used* at the application level. The email-sig/Mailman people will testify how hard this is to do well, even when you have a handful of RFCs that specify how it is to be done! On the other hand, this kind of robustness is almost never needed in "general newbie programming", except when you are writing a program to be used to clean up after an undisciplined administration, or some other system disaster. Under normal circumstances the system encoding is well-known and conformance is universal. The best you can do for a general programming system is to heuristically determine a single system encoding and raise an error if the decoding fails.
import os import os.path import sys if os.path.supports_unicode_filenames: cwd = getcwd() else: cwd = getcwdb() encoding = sys.getfilesystemencoding() for filename in os.listdir(cwd): if os.path.supports_unicode_filenames: text = str(filename, encoding, "replace) else: text = filename print("=== File {0} ===".format(text)) for line in open(filename): ...
We need an "if" to choose the directory. The second "if" is only needed to display the filename. Using bytes, it would be possible to write better code detect the real charset (eg. ISO-8859-1 in a UTF-8 file system) and so display correctly the filename and/or propose to rename the file. Would it possible using UTF-8b / PUA hacks?
Not sure what "it" is: to write the code above using the PUA hack: for filename in os.listdir(os.getcwd()) text = repr(filename) print("=== File {0} ===".format(text)) for line in open(filenmae): ... If "it" is "display the filename": sure, see above. If "it" is "detect the real charset": sure, why not? Regards, Martin
On Mon, Sep 29, 2008 at 5:29 PM, Victor Stinner <victor.stinner@haypocalc.com> wrote:
Le Monday 29 September 2008 19:06:01 Guido van Rossum, vous avez écrit :
- listdir(unicode) -> unicode and raise an error on invalid filename
I know I keep flipflopping on this one, but the more I think about it the more I believe it is better to drop those names than to raise an exception. Otherwise a "naive" program that happens to use os.listdir() can be rendered completely useless by a single non-UTF-8 filename. Consider the use of os.listdir() by the glob module. If I am globbing for *.py, why should the presence of a file named b'\xff' cause it to fail?
It would be hard for a newbie programmer to understand why he's unable to find his very important file ("important r?port.doc") using os.listdir(). And yes, if your file system is broken, glob(<unicode>) will fail.
Imagine a program that list all files in a dir, as well as their file size. If we return bytes we'll print the name wrong. If we return lossy unicode we'll be unable to get the size of some files. If we return a malformed unicode we'll be unable to print at all (and what if this is a GUI app?) The common use cases need unicode, so the best options for them are to fail outright or skip bad filenames. The uncommon use cases need bytes, and they could do an explicit lossy decode for printing, while still keeping the internal file name as bytes. Failing outright does have the advantage that the resulting exception should have a half-decent approximation of the bad filename. (Thanks to the recent choices on unicode repr() and having stderr do escapes.) -- Adam Olsen, aka Rhamphoryncus
Le Monday 29 September 2008 19:06:01 Guido van Rossum, vous avez écrit :
I know I keep flipflopping on this one, but the more I think about it the more I believe it is better to drop those names than to raise an exception. Otherwise a "naive" program that happens to use os.listdir() can be rendered completely useless by a single non-UTF-8 filename. Consider the use of os.listdir() by the glob module. If I am globbing for *.py, why should the presence of a file named b'\xff' cause it to fail?
To avoid silent skipping, is it possible to drop 'unreadable' names, issue a warning (instead of exception), and continue to completion? "Warning: unreadable filename skipped; see PyWiki/UnreadableFilenames"
On Mon, Sep 29, 2008 at 8:55 PM, Terry Reedy <tjreedy@udel.edu> wrote:
Le Monday 29 September 2008 19:06:01 Guido van Rossum, vous avez écrit :
I know I keep flipflopping on this one, but the more I think about it the more I believe it is better to drop those names than to raise an exception. Otherwise a "naive" program that happens to use os.listdir() can be rendered completely useless by a single non-UTF-8 filename. Consider the use of os.listdir() by the glob module. If I am globbing for *.py, why should the presence of a file named b'\xff' cause it to fail?
To avoid silent skipping, is it possible to drop 'unreadable' names, issue a warning (instead of exception), and continue to completion? "Warning: unreadable filename skipped; see PyWiki/UnreadableFilenames"
That would be annoying as hell in most cases. I consider the dropping of unreadable names similar to the suppression of "hidden" files by various operating systems. -- --Guido van Rossum (home page: http://www.python.org/~guido/)
On Tue, 30 Sep 2008 11:50:10 pm Guido van Rossum wrote:
To avoid silent skipping, is it possible to drop 'unreadable' names, issue a warning (instead of exception), and continue to completion? "Warning: unreadable filename skipped; see PyWiki/UnreadableFilenames"
That would be annoying as hell in most cases.
Doesn't the warning module default to only displaying the warning once per session? I don't see that it would be annoying as hell to be notified once per session that an error has occurred. What I'd find annoying as hell would be something like this: $ ls . | wc -l 25 $ python ...
import os len(os.listdir('.') 24
Give me a nice clear error, or even a warning. Don't let the error pass silently, unless I explicitly silence it.
I consider the dropping of unreadable names similar to the suppression of "hidden" files by various operating systems.
With the exception of '.' and '..', I consider "hidden" files to be a serious design mistake, but at least most operating systems give the user a way to easily see all such hidden files if you ask. (Almost all. Windows has "superhidden" files that remain hidden even when the user asks to see hidden files, all the better to hide malware. But that's a rant for another list.) -- Steven
On Tue, Sep 30, 2008 at 7:53 AM, Steven D'Aprano <steve@pearwood.info> wrote:
On Tue, 30 Sep 2008 11:50:10 pm Guido van Rossum wrote:
To avoid silent skipping, is it possible to drop 'unreadable' names, issue a warning (instead of exception), and continue to completion? "Warning: unreadable filename skipped; see PyWiki/UnreadableFilenames"
That would be annoying as hell in most cases.
Doesn't the warning module default to only displaying the warning once per session? I don't see that it would be annoying as hell to be notified once per session that an error has occurred.
What I'd find annoying as hell would be something like this:
$ ls . | wc -l 25 $ python ...
import os len(os.listdir('.') 24
And yet similar discrepancies happen all the time -- ls suppresses filenames starting with '.', while os.listdir() shows them (except '.' and '..' themselves). The Mac Finder and its Windows equivalent hide lots of files from you. And have you considered mount points (on Unix)? Face it. Filesystems are black boxes. They have roughly specified behavior, but calls into them can fail or seem inconsistent for many reasons -- concurrent changes by other processes, hidden files (Windows), files that exist but can't be opened due to kernel-level locking, etc. It's best not to worry too much about this. Here's another anomaly:
import os '.snapshot' in os.listdir('.') False os.chdir('.snapshot') os.getcwd() '/home/guido/bin/.snapshot'
IOW there's a hidden .snapshot directory that os.listdir() doesn't return -- but it exists! This is a standard feature on NetApp filers. (The reason this file is extra hidden is that it gives access to an infinite set of backups that you don't want to be found by find(1), os.walk() and their kin.)
Give me a nice clear error, or even a warning. Don't let the error pass silently, unless I explicitly silence it.
Depends on your use case. We're talking here of a family of APIs where different programs have different needs. I assert that most programs are best served by an API that doesn't give them surprising and irrelevant errors, as long as there's also an API for the few that want to get to the bottom of things (or as close as they can get -- see above '.snapshot' example).
I consider the dropping of unreadable names similar to the suppression of "hidden" files by various operating systems.
With the exception of '.' and '..', I consider "hidden" files to be a serious design mistake, but at least most operating systems give the user a way to easily see all such hidden files if you ask.
(Almost all. Windows has "superhidden" files that remain hidden even when the user asks to see hidden files, all the better to hide malware. But that's a rant for another list.)
Rant all you want, it won't go away. -- --Guido van Rossum (home page: http://www.python.org/~guido/)
Steven D'Aprano schrieb:
On Tue, 30 Sep 2008 11:50:10 pm Guido van Rossum wrote:
To avoid silent skipping, is it possible to drop 'unreadable' names, issue a warning (instead of exception), and continue to completion? "Warning: unreadable filename skipped; see PyWiki/UnreadableFilenames"
That would be annoying as hell in most cases.
Doesn't the warning module default to only displaying the warning once per session? I don't see that it would be annoying as hell to be notified once per session that an error has occurred.
What I'd find annoying as hell would be something like this:
$ ls . | wc -l 25 $ python ....
import os len(os.listdir('.') 24
Give me a nice clear error, or even a warning. Don't let the error pass silently, unless I explicitly silence it.
Just another data point: I've just looked at Qt, which provides a filesystem API and whose strings are Unicode, and it seems to drop undecodable filenames as well. Georg -- Thus spake the Lord: Thou shalt indent with four spaces. No more, no less. Four shall be the number of spaces thou shalt indent, and the number of thy indenting shall be four. Eight shalt thou not indent, nor either indent thou two, excepting that thou then proceed to four. Tabs are right out.
participants (9)
-
"Martin v. Löwis"
-
Adam Olsen
-
Georg Brandl
-
Guido van Rossum
-
James Y Knight
-
Stephen J. Turnbull
-
Steven D'Aprano
-
Terry Reedy
-
Victor Stinner