[issue12281] bytes.decode('mbcs', 'ignore') does replace undecodable bytes on Windows Vista or later
report at bugs.python.org
Wed Jun 8 14:47:44 CEST 2011
STINNER Victor <victor.stinner at haypocalc.com> added the comment:
mbcs.patch fixes PyUnicode_DecodeMBCS():
- only use flags=0 if errors="replace" on Windows >= Vista or if errors="ignore" on Windows < Vista
- support any error handler
- support any code page (but the code page is hardcoded to CP_ACP)
My patch always tries to decode in strict mode. On decode error: it decodes byte per byte, and call unicode_decode_call_errorhandler() on error.
- don't use insize=1 (decode byte per byte): it doesn't work with multibyte encodings (like UTF-8)
- use final in decode_mbcs_errors(): a multibyte character may be splitted between two chunks of INT_MAX bytes
- fix all FIXME
- patch also PyUnicode_EncodeMBCS()
- implement suggested Martin's optimizations?
- MB_ERR_INVALID_CHARS is not supported by some code pages (e.g. UTF-7 code page)
Is it necessary to write a NUL character at the end? ("*out = 0;")
It would be nice to support any code page, and maybe support more options (e.g. MB_COMPOSITE, MB_PRECOMPOSED, MB_USEGLYPHCHARS to decode).
It is possible to test different code pages by changing the hardcoded code_page value in PyUnicode_DecodeMBCS. Change your region in the control panel if you would like to change the Windows ANSI code page. You can also play with SetThreadLocale() and CP_THREAD_ACP to test the ANSI code page of the current thread.
Added file: http://bugs.python.org/file22282/mbcs.patch
Python tracker <report at bugs.python.org>
More information about the Python-bugs-list