Re: [Python-Dev] [Python-3000] New proposition for Python3 bytes filename issue
On Windows, we might reject bytes filenames for all file operations: open(), unlink(), os.path.join(), etc. (raise a TypeError or UnicodeError)
Since I've seen no objections to this yet: please no. If we offer a "lower-level" bytes filename API, it should work for all platforms.
Unfortunately, it can't. You cannot represent all possible file names in a byte string in Windows (just as you can't do so in a Unicode string on Unix). So using byte strings on Windows would work for some files, but fail for others. In particular, listdir might give you a list of file names which you then can't open/stat/recurse into. (of course, you could use UTF-8 as the file system encoding on Windows, but then you will have to rewrite a lot of C code first) Regards, Martin
On Sep 30, 2008, at 5:40 PM, Martin v. Löwis wrote:
On Windows, we might reject bytes filenames for all file operations: open(), unlink(), os.path.join(), etc. (raise a TypeError or UnicodeError)
Since I've seen no objections to this yet: please no. If we offer a "lower-level" bytes filename API, it should work for all platforms.
Unfortunately, it can't. You cannot represent all possible file names in a byte string in Windows (just as you can't do so in a Unicode string on Unix).
As you mention in the parenthetical below, of course it can.
So using byte strings on Windows would work for some files, but fail for others. In particular, listdir might give you a list of file names which you then can't open/stat/recurse into.
(of course, you could use UTF-8 as the file system encoding on Windows, but then you will have to rewrite a lot of C code first)
Yes! If there is a byte-string access method for Windows, pretty please make it decode from UTF-8 internally and call the Unicode version of the Windows APIs. The non-unicode windows APIs are pretty much just broken -- Ideally, Python should never be calling those. But, I still don't like the idea of propagating the "sometimes a string, sometimes bytes" APIs...One or the other, please. Either always strings (if and only if a method for assuring decoding always succeeds), or always bytes. James
Yes! If there is a byte-string access method for Windows, pretty please make it decode from UTF-8 internally and call the Unicode version of the Windows APIs. The non-unicode windows APIs are pretty much just broken -- Ideally, Python should never be calling those.
I don't think we will manage to release Python 3.0 this year if that change is to be implemented. And then, I don't think the release manager will agree to such a delay. I disagree that the ANSI APIs are broken. For most users (and by that, I mean much more than 99% of the world population with access to Windows computers), they work just fine. You have to deliberately try to break them, or work in an environment were you speak multiple languages (with conflicting scripts) simultaneously. Practicality beats purity, and I applaud Microsoft for such a foresighted design (they are guilty for bad designs in other places, but this one really gives a good tradeoff of all issues, all things considered). Regards, Martin
On Wed, 1 Oct 2008 07:40:01 am Martin v. Löwis wrote:
On Windows, we might reject bytes filenames for all file operations: open(), unlink(), os.path.join(), etc. (raise a TypeError or UnicodeError)
Since I've seen no objections to this yet: please no. If we offer a "lower-level" bytes filename API, it should work for all platforms.
Unfortunately, it can't. You cannot represent all possible file names in a byte string in Windows (just as you can't do so in a Unicode string on Unix).
Sorry, maybe I'm just being thick here, but I don't understand how that is possible. On the physical disk, each Windows file name must be represented by a byte string, yes? So how is it possible that there are Windows files with names that can't be represented as a byte string? What have I missed? -- Steven
On Tue, Sep 30, 2008 at 4:08 PM, Steven D'Aprano <steve@pearwood.info> wrote:
On Wed, 1 Oct 2008 07:40:01 am Martin v. Löwis wrote:
On Windows, we might reject bytes filenames for all file operations: open(), unlink(), os.path.join(), etc. (raise a TypeError or UnicodeError)
Since I've seen no objections to this yet: please no. If we offer a "lower-level" bytes filename API, it should work for all platforms.
Unfortunately, it can't. You cannot represent all possible file names in a byte string in Windows (just as you can't do so in a Unicode string on Unix).
Sorry, maybe I'm just being thick here, but I don't understand how that is possible. On the physical disk, each Windows file name must be represented by a byte string, yes? So how is it possible that there are Windows files with names that can't be represented as a byte string? What have I missed?
I believe on disk it uses UTF-16. -- --Guido van Rossum (home page: http://www.python.org/~guido/)
On Wed, 1 Oct 2008 09:21:37 am you wrote:
On Tue, Sep 30, 2008 at 4:08 PM, Steven D'Aprano <steve@pearwood.info> wrote:
On Wed, 1 Oct 2008 07:40:01 am Martin v. Löwis wrote:
On Windows, we might reject bytes filenames for all file operations: open(), unlink(), os.path.join(), etc. (raise a TypeError or UnicodeError)
Since I've seen no objections to this yet: please no. If we offer a "lower-level" bytes filename API, it should work for all platforms.
Unfortunately, it can't. You cannot represent all possible file names in a byte string in Windows (just as you can't do so in a Unicode string on Unix).
Sorry, maybe I'm just being thick here, but I don't understand how that is possible. On the physical disk, each Windows file name must be represented by a byte string, yes? So how is it possible that there are Windows files with names that can't be represented as a byte string? What have I missed?
I believe on disk it uses UTF-16.
Which is made up of bytes. There may be byte sequences that are illegal UTF-16, but that's not what Martin said. I don't understand how there can be UTF-16 sequences which don't correspond to some sequence of bytes. How would they be represented in memory? Is this to do with the endianness of the UTF-16 sequence? -- Steven
On Tue, Sep 30, 2008 at 7:04 PM, Steven D'Aprano <steve@pearwood.info> wrote:
I believe on disk it uses UTF-16.
Which is made up of bytes. There may be byte sequences that are illegal UTF-16, but that's not what Martin said. I don't understand how there can be UTF-16 sequences which don't correspond to some sequence of bytes. How would they be represented in memory? Is this to do with the endianness of the UTF-16 sequence?
It has to do with the internal mapping between the ANSI and Unicode functions. On NT systems, CreateFileA will map the ANSI bytestring to a Unicode filename via the active code page, and call CreateFileW accordingly. The active code page cannot be set to something as useful as UTF-8, so given any actual code page (1252, 932, etc.) there are Unicode strings that cannot be represented with a bytestring provided to the ANSI function. -- Michael Urman
Le Wednesday 01 October 2008 00:28:22 Martin v. Löwis, vous avez écrit :
I don't think we will manage to release Python 3.0 this year if that change is to be implemented. And then, I don't think the release manager will agree to such a delay.
The minimum change is to disallow bytes/str mix: - os.listdir(unicode)->unicode and ignore invalid files (current behaviour is to return unicode and bytes) - os.readlink(unicode)->unicode or raise an error (current behaviour is to return unicode or bytes) - remove os.getcwdu() (use its code -which is better- for getcwd) and fix the test_unicode_file.py listdir() change (ignore invalid filenames) is important to avoid strange bugs in os.path.*(), glob.*() or on displaying a filename. I can generate a specific patch for these issues. It's just a subset of my last patch. -- Victor Stinner aka haypo http://www.haypocalc.com/blog/
Sorry, maybe I'm just being thick here, but I don't understand how that is possible. On the physical disk, each Windows file name must be represented by a byte string, yes? So how is it possible that there are Windows files with names that can't be represented as a byte string? What have I missed?
That we are not really free to choose the byte representation when choosing byte strings. Microsoft has defined how char* (i.e. byte strings) are to be interpreted when interpreting them as byte strings, namely in the ANSI code page. That code page is not capable of representing all file names. We could, for example, use the same representation as is used on disk. However, a) there is no API to find out what that representation is, and b) it is not null-byte free, a property often desired for file names, and c) because it contains null bytes, it won't be easy to display such file names on stdout, or in a GUI window. Regards, Martin
participants (6)
-
"Martin v. Löwis"
-
Guido van Rossum
-
James Y Knight
-
Michael Urman
-
Steven D'Aprano
-
Victor Stinner