Use our strict mbcs codec instead of the Windows ANSI API
Hi, I propose to raise Unicode errors if a filename cannot be decoded on Windows, instead of creating a bogus filenames with questions marks. Because this change is incompatible with Python 3.2, even if such filenames are unusable and I consider the problem as a (Python?) bug, I would like your opinion on such change before working on a patch. -- Windows works internally on Unicode strings since Windows 95 (or something like that), but provides also an "ANSI" API using the ANSI code page and byte strings for backward compatibility. It was already proposed to drop completly the bytes API in our nt (os) module, but it may break the Python backward compatibility (and it is difficult to list Python programs using the bytes API to access the file system). The ANSI API uses MultiByteToWideChar (decode) and WideCharToMultiByte (encode) functions in the default mode (flags=0): MultiByteToWideChar() replaces undecodable bytes by '?' and WideCharToMultiByte() ignores unencodable characters (!!!). This behaviour produces invalid filenames (see for example the issue #13247) and *the user is unable to detect codec errors*. In Python 3.2, I changed the MBCS codec to make it strict: it raises a UnicodeEncodeError if a character cannot be encoded to the ANSI code page (e.g. encode Ł to cp1252) and a UnicodeDecodeError if a character cannot be decoded from the ANSI code page (e.g. b'\xff' from cp932). I propose to reuse our MBCS codec in strict mode (error handler="strict"), to notice directly encode/decode errors, with the Windows native (wide character) API. It should simplify the source code: replace 2 versions of a function by 1 version + optional code to decode arguments and/or encode the result. -- Read also the previous thread: [Python-Dev] Byte filenames in the posix module on Windows Wed Jun 8 00:23:20 CEST 2011 http://mail.python.org/pipermail/python-dev/2011-June/111831.html -- FYI I patched again Python MBCS codec: it now handles correclty ignore and replace mode (to encode and decode), but now also supports any error handler. -- We might use the PEP 383 to store undecoable bytes as surrogates (U+DC80- U+DCFF). But the situation is the opposite of the situtation on UNIX: on Windows, the problem is more on encoding (text->bytes) than on decoding (bytes->text). On UNIX, problems occur when the system is misconfigured (e.g. wrong locale encoding). On Windows, problems occur when your application uses the old (ANSI) API, whereas your filesystem is fully Unicode compliant and you created Unicode filenames with a program using the new (Windows) API. Only few programs are fully Unicode compliant. A lot of programs fail if a filename cannot be encoded to the ANSI code page (just 2 examples: Mercurial and Visual Studio). Victor
On Tue, Oct 25, 2011 at 8:57 AM, Victor Stinner <victor.stinner@haypocalc.com> wrote:
The ANSI API uses MultiByteToWideChar (decode) and WideCharToMultiByte (encode) functions in the default mode (flags=0): MultiByteToWideChar() replaces undecodable bytes by '?' and WideCharToMultiByte() ignores unencodable characters (!!!). This behaviour produces invalid filenames (see for example the issue #13247) and *the user is unable to detect codec errors*.
In Python 3.2, I changed the MBCS codec to make it strict: it raises a UnicodeEncodeError if a character cannot be encoded to the ANSI code page (e.g. encode Ł to cp1252) and a UnicodeDecodeError if a character cannot be decoded from the ANSI code page (e.g. b'\xff' from cp932).
I propose to reuse our MBCS codec in strict mode (error handler="strict"), to notice directly encode/decode errors, with the Windows native (wide character) API. It should simplify the source code: replace 2 versions of a function by 1 version + optional code to decode arguments and/or encode the result.
So we'd be taking existing failures that appear at whatever point the corrupted filename is used and replacing them with explicit failures at the point where the offending string is converted to or from encoded bytes? That sounds reasonable to me, and a lot closer to the way Python behaves on POSIX based systems. Regards, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
+1 from me! Mark On 25/10/2011 9:57 AM, Victor Stinner wrote:
Hi,
I propose to raise Unicode errors if a filename cannot be decoded on Windows, instead of creating a bogus filenames with questions marks. Because this change is incompatible with Python 3.2, even if such filenames are unusable and I consider the problem as a (Python?) bug, I would like your opinion on such change before working on a patch.
--
Windows works internally on Unicode strings since Windows 95 (or something like that), but provides also an "ANSI" API using the ANSI code page and byte strings for backward compatibility. It was already proposed to drop completly the bytes API in our nt (os) module, but it may break the Python backward compatibility (and it is difficult to list Python programs using the bytes API to access the file system).
The ANSI API uses MultiByteToWideChar (decode) and WideCharToMultiByte (encode) functions in the default mode (flags=0): MultiByteToWideChar() replaces undecodable bytes by '?' and WideCharToMultiByte() ignores unencodable characters (!!!). This behaviour produces invalid filenames (see for example the issue #13247) and *the user is unable to detect codec errors*.
In Python 3.2, I changed the MBCS codec to make it strict: it raises a UnicodeEncodeError if a character cannot be encoded to the ANSI code page (e.g. encode Ł to cp1252) and a UnicodeDecodeError if a character cannot be decoded from the ANSI code page (e.g. b'\xff' from cp932).
I propose to reuse our MBCS codec in strict mode (error handler="strict"), to notice directly encode/decode errors, with the Windows native (wide character) API. It should simplify the source code: replace 2 versions of a function by 1 version + optional code to decode arguments and/or encode the result.
--
Read also the previous thread:
[Python-Dev] Byte filenames in the posix module on Windows Wed Jun 8 00:23:20 CEST 2011 http://mail.python.org/pipermail/python-dev/2011-June/111831.html
--
FYI I patched again Python MBCS codec: it now handles correclty ignore and replace mode (to encode and decode), but now also supports any error handler.
--
We might use the PEP 383 to store undecoable bytes as surrogates (U+DC80- U+DCFF). But the situation is the opposite of the situtation on UNIX: on Windows, the problem is more on encoding (text->bytes) than on decoding (bytes->text). On UNIX, problems occur when the system is misconfigured (e.g. wrong locale encoding). On Windows, problems occur when your application uses the old (ANSI) API, whereas your filesystem is fully Unicode compliant and you created Unicode filenames with a program using the new (Windows) API.
Only few programs are fully Unicode compliant. A lot of programs fail if a filename cannot be encoded to the ANSI code page (just 2 examples: Mercurial and Visual Studio).
Victor _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/skippy.hammond%40gmail.com
Victor Stinner writes:
I propose to raise Unicode errors if a filename cannot be decoded on Windows, instead of creating a bogus filenames with questions marks.
By "bogus" you mean "sometimes (?) invalid and the OS will refuse to use them, causing a later hard-to-diagnose exception", rather than "not what the user thinks he wants", right? In the "hard errors" case, a hearty +1 (I'm dealing with this in an experimental version of XEmacs and it's a right PITA if the codec doesn't complain). Backward compatibility is important, but here the costs of fixing such bugs outweigh the value of bug-compatibility. In the latter (doing things behind the users back rather than actually breaking the program), I'm basically +1 but do worry about backward compatibility.
Le Mardi 25 Octobre 2011 13:20:12 vous avez écrit :
Victor Stinner writes:
I propose to raise Unicode errors if a filename cannot be decoded on Windows, instead of creating a bogus filenames with questions marks.
By "bogus" you mean "sometimes (?) invalid and the OS will refuse to use them, causing a later hard-to-diagnose exception", rather than "not what the user thinks he wants", right?
If the ("Unicode") filename cannot be encoded to the ANSI code page, which is usually a small charset (e.g. cp1252 contains 256 code points), Windows replaces unencodable characters by question marks. Imagine that the code page is ASCII, the ("Unicode") filename "hého.txt" will be encoded to b"h?ho.txt". You can display this string in a dialog, but you cannot open the file to read its content... If you pass the filename to os.listdir(), it is even worse because "?" is interpreted ("?" means any character, it's a pattern to match a filename). I would like to raise an error on such situation, because currently the user cannot be noticed otherwise. The user may search "?" in the filename, but Windows replaces also unencodable characters by *similar glyph* (e.g. "é" replaced by "e").
In the "hard errors" case, a hearty +1 (I'm dealing with this in an experimental version of XEmacs and it's a right PITA if the codec doesn't complain).
If you use MultiByteToWideChar and WideCharToMultiByte, you can be noticed on error using some flags, but functions of the ANSI API doesn't give access to these flags...
Backward compatibility is important, but here the costs of fixing such bugs outweigh the value of bug-compatibility.
I only want to change how unencodable filenames are handled, the bytes API will still be available. If you filesystem has the "8dot3name" feature enable, it may work even for unencodable filenames (Windows generates names like HEHO~1.TXT). Victor
I propose to raise Unicode errors if a filename cannot be decoded on Windows, instead of creating a bogus filenames with questions marks.
Can you please elaborate what APIs you are talking about exactly? If it's the byte APIs (i.e. using bytes as file names), then I'm -1 on this proposal. People that explicitly use bytes for file names deserve to get whatever exact platform semantics the platform has to offer. This is true on Unix, and it is also true on Windows. Regards, Martin
Le Mardi 25 Octobre 2011 09:09:56 vous avez écrit :
I propose to raise Unicode errors if a filename cannot be decoded on Windows, instead of creating a bogus filenames with questions marks.
Can you please elaborate what APIs you are talking about exactly?
Basically, all functions processing filenames, so most functions of posixmodule.c. Some examples: - os.listdir(): FindFirstFileA, FindNextFileA, FindCloseA - os.lstat(): CreateFileA - os.getcwdb(): getcwd() - os.mkdir(): CreateDirectoryA - os.chmod(): SetFileAttributesA - ...
If it's the byte APIs (i.e. using bytes as file names), then I'm -1 on this proposal. People that explicitly use bytes for file names deserve to get whatever exact platform semantics the platform has to offer. This is true on Unix, and it is also true on Windows.
My proposition is a fix to user reported by a user: http://bugs.python.org/issue13247 I want to keep the bytes API for backward compatibility, and it will still work for non-ASCII characters, but only for non-ASCII characters encodable to the ANSI code page. In practice, characters not encodable to the ANSI code page are very rare. For example: it's difficult to write such characters directly with the keyboard. I bet that very few people will notify the change. Victor
My proposition is a fix to user reported by a user: http://bugs.python.org/issue13247
So your proposal is that abspath(b".") shall raise a UnicodeError in this case? Are you serious???
In practice, characters not encodable to the ANSI code page are very rare. For example: it's difficult to write such characters directly with the keyboard. I bet that very few people will notify the change.
Except people running into the very issues you are trying to resolve. I'm not sure these people are really helped by having their applications crash all of a sudden. Regards, Martin
On 10/25/2011 4:31 AM, Victor Stinner wrote:
Le Mardi 25 Octobre 2011 09:09:56 vous avez écrit :
I propose to raise Unicode errors if a filename cannot be decoded on Windows, instead of creating a bogus filenames with questions marks.
Can you please elaborate what APIs you are talking about exactly?
Basically, all functions processing filenames, so most functions of posixmodule.c. Some examples:
This seems way too broad. From you previous posts, I presumed that you only propose to change behavior when the user asks for the bytes versions of a unicode name that cannot be properly converted to a bytes version.
- os.listdir():
os.listdir(unicode) works fine and should not be changed. os.listdir(bytes) is what OP of issue wants changed.
FindFirstFileA, FindNextFileA, FindCloseA
There are not Python names. Are they Windows API names?
- os.lstat(): CreateFileA
This does not create a path and should not be changed as far as I can see.
- os.getcwdb():
This you might change.
getcwd()
This should not be, as no bytes are involved.
- os.mkdir(): CreateDirectoryA - os.chmod(): SetFileAttributesA
Like os.lstat, these accept only accept a path and should do what they are supposed to do.
If it's the byte APIs (i.e. using bytes as file names), then I'm -1 on this proposal. People that explicitly use bytes for file names deserve to get whatever exact platform semantics the platform has to offer. This is true on Unix, and it is also true on Windows.
My proposition is a fix to user reported by a user: http://bugs.python.org/issue13247
I want to keep the bytes API for backward compatibility, and it will still work for non-ASCII characters, but only for non-ASCII characters encodable to the ANSI code page.
In practice, characters not encodable to the ANSI code page are very rare. For example: it's difficult to write such characters directly with the keyboard. I bet that very few people will notify the change.
Actually, Windows makes switching keyboard setups rather easy once you enable the feature. It might be that people who routinely use non-'ansi' characters in file and directory names do not routinely ask for bytes versions thereof. The doc says "All functions accepting path or file names accept both bytes and string objects, and result in an object of the same type, if a path or file name is returned." It does that now, though it says nothing about the encoding assumed for input bytes or used for output bytes. It does not mention raising exceptions, so doing so is a feature-change that would likely break code. Currently, exceptional situations are signalled with "'?' in returned_path" rather than with an exception object. It ('?') is a bad choice of signal though, given the other uses of '?' in paths. -- Terry Jan Reedy
In general I agree with what you write, Terry. One clarification and one comment, though. Terry Reedy writes:
The doc says "All functions accepting path or file names accept both bytes and string objects, and result in an object of the same type, if a path or file name is returned." It does that now, though it says nothing about the encoding assumed for input bytes or used for output bytes.
That's determined by the OS, and figuring that out is the end user's problem.
It does not mention raising exceptions, so doing so is a feature-change that would likely break code. Currently, exceptional situations are signalled with "'?' in returned_path" rather than with an exception object. It ('?') is a bad choice of signal though, given the other uses of '?' in paths.
True, but this isn't really Python's problem. And IIUC Martin's post, it is hardly "exceptional": isn't Python doing this, it's just standard Windows behavior, which results in pathnames that are perfectly acceptable to Windows APIs, but unreliable in use because they have different semantics in different Windows APIs. If that is true, there are almost surely user programs that depend on this behavior, even though it sucks.[1] My original "hearty +1" was dependent on my understanding from Victor's post that this substitution could cause later exceptions because filename is invalid (eg, contains illegal characters causing Windows to signal an error). If that's not true, I think the proper remedy is to add a strong warning to pylint that use of those APIs is supported (eg, for interaction with existing programs that use them) but that they require careful error-checking for robust use. As a card-carrying Unicode nazi I wouldn't mind tagging the bytes APIs with a DeprecationWarning but I know that proposal is going nowhere so I withdraw it in advance. <wink> Footnotes: [1] Note that the original rationale for this was surely "since users will have a very hard time using file names with this character in them, using it as a substitution character internally will make the problem evident and Sufficiently Smart Programs can deal with it."
Le Mardi 25 Octobre 2011 10:31:56 Victor Stinner a écrit :
Basically, all functions processing filenames, so most functions of posixmodule.c. Some examples:
- os.listdir(): FindFirstFileA, FindNextFileA, FindCloseA - os.lstat(): CreateFileA - os.getcwdb(): getcwd() - os.mkdir(): CreateDirectoryA - os.chmod(): SetFileAttributesA - ...
This seems way too broad.
I changed my mind about this list: I only want to change how filenames are encoded, not how filenames are decoded. So only os.listdir() & os.getcwdb() should be changed, as I wrote in another email in this thread and in the issue #13247.
- os.getcwdb(): This you might change.
Issue #13247 combines os.getcwdb() and os.listdir(). Read the issue for more information.
It ('?') is a bad choice of signal though, given the other uses of '?' in paths.
If I understood correctly, '?' is a pattern to match any character in FindFirstFile/FindNextFile. Python cannot configure the replacement character, it's hardcoded to "?" (U+003F).
it's just standard Windows behavior, which results in pathnames that are perfectly acceptable to Windows APIs, but unreliable in use because they have different semantics in different Windows APIs.
I think that such filenames cannot be used with any Windows function accessing to the filesystem. Extract of the issue: "Such filenames cannot be used, open() fails with OSError(22, "invalid argument: '?'") for example." You can only be used if you want to display the content of a directory, but don't expect to be able to read file content. -- Anyway, you must use Unicode on Windows! The bytes API was just kept for backward compatibility. Victor
Le Mardi 25 Octobre 2011 09:09:56 vous avez écrit :
If it's the byte APIs (i.e. using bytes as file names), then I'm -1 on this proposal. People that explicitly use bytes for file names deserve to get whatever exact platform semantics the platform has to offer. This is true on Unix, and it is also true on Windows.
For your information, it took me something like 3 months (when I was working on the issue #12281) to understand exactly how Windows handles undecodable bytes and unencodable characters. I did a lot of tests on different Windows versions (XP, Vista and Seven, the behaviour changed in Windows Vista). I had to take notes because it is really complex. Well, I wanted to understand exactly *all* code pages, including CP_UTF7 and CP_UTF8, not only the most common ones like cp1252 or cp932. See the dedicated section in my book to learn more about these funtions: http://www.haypocalc.com/tmp/unicode-2011-07-20/html/operating_systems.html#... and-decode-functions Some information are available in MultiByteToWideChar and WideCharToMultiByte documentation, but they are not well explained :-p Victor
On Tue, 25 Oct 2011 00:57:42 +0200 Victor Stinner <victor.stinner@haypocalc.com> wrote:
Hi,
I propose to raise Unicode errors if a filename cannot be decoded on Windows, instead of creating a bogus filenames with questions marks. Because this change is incompatible with Python 3.2, even if such filenames are unusable and I consider the problem as a (Python?) bug, I would like your opinion on such change before working on a patch.
+1 from me. Regards Antoine.
Le mardi 25 octobre 2011 00:57:42, Victor Stinner a écrit :
I propose to raise Unicode errors if a filename cannot be decoded on Windows, instead of creating a bogus filenames with questions marks. Because this change is incompatible with Python 3.2, even if such filenames are unusable and I consider the problem as a (Python?) bug, I would like your opinion on such change before working on a patch.
Most people like the idea, so I wrote a patch and attached it to: http://bugs.python.org/issue13247 The patch only changes os.getcwdb() and os.listdir().
We might use the PEP 383 to store undecoable bytes as surrogates (U+DC80- U+DCFF). But the situation is the opposite of the situtation on UNIX: on Windows, the problem is more on encoding (text->bytes) than on decoding (bytes->text). On UNIX, problems occur when the system is misconfigured (e.g. wrong locale encoding). On Windows, problems occur when your application uses the old (ANSI) API, whereas your filesystem is fully Unicode compliant and you created Unicode filenames with a program using the new (Windows) API.
I only changed functions returning filenames, so os.mkdir() is unchanged for example. We may also patch the other functions to simplify the source code. Victor
participants (7)
-
"Martin v. Löwis"
-
Antoine Pitrou
-
Mark Hammond
-
Nick Coghlan
-
Stephen J. Turnbull
-
Terry Reedy
-
Victor Stinner