[Python-ideas] Support WHATWG versions of legacy encodings

Thu Jan 11 12:19:26 EST 2018

On Thu, Jan 11, 2018, at 03:58, M.-A. Lemburg wrote:
> There's a problem with these encodings: they are mostly meant
> for decoding (broken) data, but as soon as we have them in the stdlib,
> people will also start using them for encoding data, producing more
> corrupted data.

Is it really corrupted?

> Do you really things it's a good idea to support this natively
> in Python ?

The problem is, that's ignoring the very real fact that this is, and has always been* the behavior of the native encodings built in to Windows. My opinion is that Microsoft, for whatever reason, misrepresented their encodings when they submitted them to Unicode. The native APIs for text conversion have mechanisms for error reporting, and these supposedly undefined characters do not trigger them as they do for e.g. CP932 0xA0.

Without the MB_ERR_INVALID_CHARS flag, cp932 0xA0 maps to U+F8F0 (private use), a best fit mapping, and cp1252 0x81 maps to U+0081 (one of the mappings being discussed here)
If you do set the MB_ERR_INVALID_CHARS flag, however, cp932 0xA0 returns an error 1113** (ERROR_NO_UNICODE_TRANSLATION), whereas cp1252 0x81 still returns U+0081.

As far as the actual encoding implemented in windows is concerned, CP1252's 0x81->U+0081 mapping is a wholly valid one (though undocumented), and not in any way a fallback or a "best fit" or an invalid character.

*except for the addition of the Euro sign to each encoding at typically 0x80 in circa 1998.
**It's worth mentioning that our cp932 returns U+F8F0, even with errors='strict', despite this not being present in the unicode published mapping. It has done this at least since the CJKCodecs change in 2004. I can't determine where (or if) it was implemented at all before that.