[Python-Dev] Use our strict mbcs codec instead of the Windows ANSI API

Nick Coghlan ncoghlan at gmail.com
Tue Oct 25 01:55:48 CEST 2011


On Tue, Oct 25, 2011 at 8:57 AM, Victor Stinner
<victor.stinner at haypocalc.com> wrote:
> The ANSI API uses MultiByteToWideChar (decode) and WideCharToMultiByte
> (encode) functions in the default mode (flags=0): MultiByteToWideChar()
> replaces undecodable bytes by '?' and WideCharToMultiByte() ignores
> unencodable characters (!!!). This behaviour produces invalid filenames (see
> for example the issue #13247) and *the user is unable to detect codec errors*.
>
> In Python 3.2, I changed the MBCS codec to make it strict: it raises a
> UnicodeEncodeError if a character cannot be encoded to the ANSI code page
> (e.g. encode Ł to cp1252) and a UnicodeDecodeError if a character cannot be
> decoded from the ANSI code page (e.g. b'\xff' from cp932).
>
> I propose to reuse our MBCS codec in strict mode (error handler="strict"), to
> notice directly encode/decode errors, with the Windows native (wide character)
> API. It should simplify the source code: replace 2 versions of a function by 1
> version + optional code to decode arguments and/or encode the result.

So we'd be taking existing failures that appear at whatever point the
corrupted filename is used and replacing them with explicit failures
at the point where the offending string is converted to or from
encoded bytes? That sounds reasonable to me, and a lot closer to the
way Python behaves on POSIX based systems.

Regards,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia


More information about the Python-Dev mailing list