[New-bugs-announce] [issue46668] encodings: the "mbcs" alias doesn't work
STINNER Victor
report at bugs.python.org
Sun Feb 6 18:06:49 EST 2022
New submission from STINNER Victor <vstinner at python.org>:
While working on bpo-46659, I found a bug in the encodings "mbcs" alias. Even if the function has 2 tests (in test_codecs and test_site), both tests missed the bug :-(
I fixed the alias with this change:
---
commit 04dd60e50cd3da48fd19cdab4c0e4cc600d6af30
Author: Victor Stinner <vstinner at python.org>
Date: Sun Feb 6 21:50:09 2022 +0100
bpo-46659: Update the test on the mbcs codec alias (GH-31168)
encodings registers the _alias_mbcs() codec search function before
the search_function() codec search function. Previously, the
_alias_mbcs() was never used.
Fix the test_codecs.test_mbcs_alias() test: use the current ANSI code
page, not a fake ANSI code page number.
Remove the test_site.test_aliasing_mbcs() test: the alias is now
implemented in the encodings module, no longer in the site module.
---
But Eryk found two bugs:
"""
This was never true before. With 1252 as my ANSI code page, I checked codecs.lookup('cp1252') in 2.7, 3.4, 3.5, 3.6, 3.9, and 3.10, and none of them return the "mbcs" encoding. It's not equivalent, and not supposed to be. The implementation of "cp1252" should be cross-platform, regardless of whether we're on a Windows system with 1252 as the ANSI code page, as opposed to a Windows system with some other ANSI code page, or a Linux or macOS system.
The differences are that "mbcs" maps every byte, whereas our code-page encodings do not map undefined bytes, and the "replace" handler of "mbcs" uses a best-fit mapping (e.g. "α" -> "a") when encoding text, instead of mapping all undefined characters to "?".
"""
and my new test fails if PYTHONUTF8=1 env var is set:
"""
This will fail if PYTHONUTF8 is set in the environment, because it overrides getpreferredencoding(False) and _get_locale_encoding().
"""
The code for the "mbcs" alias changed at lot between Python 3.5 and 3.7.
In Python 3.5, site module:
---
def aliasmbcs():
"""On Windows, some default encodings are not provided by Python,
while they are always available as "mbcs" in each locale. Make
them usable by aliasing to "mbcs" in such a case."""
if sys.platform == 'win32':
import _bootlocale, codecs
enc = _bootlocale.getpreferredencoding(False)
if enc.startswith('cp'): # "cp***" ?
try:
codecs.lookup(enc)
except LookupError:
import encodings
encodings._cache[enc] = encodings._unknown
encodings.aliases.aliases[enc] = 'mbcs'
---
In Python 3.6, encodings module:
---
(...)
codecs.register(search_function)
if sys.platform == 'win32':
def _alias_mbcs(encoding):
try:
import _bootlocale
if encoding == _bootlocale.getpreferredencoding(False):
import encodings.mbcs
return encodings.mbcs.getregentry()
except ImportError:
# Imports may fail while we are shutting down
pass
codecs.register(_alias_mbcs)
---
Python 3.7, encodings module:
---
(...)
codecs.register(search_function)
if sys.platform == 'win32':
def _alias_mbcs(encoding):
try:
import _winapi
ansi_code_page = "cp%s" % _winapi.GetACP()
if encoding == ansi_code_page:
import encodings.mbcs
return encodings.mbcs.getregentry()
except ImportError:
# Imports may fail while we are shutting down
pass
codecs.register(_alias_mbcs)
---
The Python 3.6 and 3.7 "codecs.register(_alias_mbcs)" doesn't work because "search_function()" is tested before and it works for "cpXXX" encodings. My changes changes the order in which codecs search functions are registered: first the MBCS alias, then the encodings search_function().
In Python 3.5, the alias was only created if Python didn't support the code page.
----------
components: Library (Lib)
messages: 412678
nosy: vstinner
priority: normal
severity: normal
status: open
title: encodings: the "mbcs" alias doesn't work
versions: Python 3.11
_______________________________________
Python tracker <report at bugs.python.org>
<https://bugs.python.org/issue46668>
_______________________________________
More information about the New-bugs-announce
mailing list