[Python-Dev] Use our strict mbcs codec instead of the Windows ANSI API

Tue Oct 25 00:57:42 CEST 2011

Hi,

I propose to raise Unicode errors if a filename cannot be decoded on Windows, 
instead of creating a bogus filenames with questions marks. Because this change 
is incompatible with Python 3.2, even if such filenames are unusable and I 
consider the problem as a (Python?) bug, I would like your opinion on such 
change before working on a patch. 

--

Windows works internally on Unicode strings since Windows 95 (or something 
like that), but provides also an "ANSI" API using the ANSI code page and byte 
strings for backward compatibility. It was already proposed to drop completly 
the bytes API in our nt (os) module, but it may break the Python backward 
compatibility (and it is difficult to list Python programs using the bytes API 
to access the file system).

The ANSI API uses MultiByteToWideChar (decode) and WideCharToMultiByte 
(encode) functions in the default mode (flags=0): MultiByteToWideChar() 
replaces undecodable bytes by '?' and WideCharToMultiByte() ignores 
unencodable characters (!!!). This behaviour produces invalid filenames (see 
for example the issue #13247) and *the user is unable to detect codec errors*.

In Python 3.2, I changed the MBCS codec to make it strict: it raises a 
UnicodeEncodeError if a character cannot be encoded to the ANSI code page 
(e.g. encode Ł to cp1252) and a UnicodeDecodeError if a character cannot be 
decoded from the ANSI code page (e.g. b'\xff' from cp932).

I propose to reuse our MBCS codec in strict mode (error handler="strict"), to 
notice directly encode/decode errors, with the Windows native (wide character) 
API. It should simplify the source code: replace 2 versions of a function by 1 
version + optional code to decode arguments and/or encode the result.

--

Read also the previous thread:

[Python-Dev] Byte filenames in the posix module on Windows
Wed Jun 8 00:23:20 CEST 2011
http://mail.python.org/pipermail/python-dev/2011-June/111831.html

--

FYI I patched again Python MBCS codec: it now handles correclty ignore and 
replace mode (to encode and decode), but now also supports any error handler.

--

We might use the PEP 383 to store undecoable bytes as surrogates (U+DC80-
U+DCFF). But the situation is the opposite of the situtation on UNIX: on 
Windows, the problem is more on encoding (text->bytes) than on decoding 
(bytes->text). On UNIX, problems occur when the system is misconfigured (e.g. 
wrong locale encoding). On Windows, problems occur when your application uses 
the old (ANSI) API, whereas your filesystem is fully Unicode compliant and you 
created Unicode filenames with a program using the new (Windows) API.

Only few programs are fully Unicode compliant. A lot of programs fail if a 
filename cannot be encoded to the ANSI code page (just 2 examples: Mercurial 
and Visual Studio).

Victor