Re: [Python-ideas] Support WHATWG versions of legacy encodings

Jan. 11, 2018

      ...
The question is rather: how often does web-XXX mojibake happen?
Very often. Particularly web-1252 mixed up with UTF-8.

My ftfy library is tested on data from Twitter and the Common Crawl, both
prime sources of mojibake. One common mojibake sequence is when a right
curly quote is encoded as UTF-8 and decoded as codepage 1252. In Python's
official windows-1252, this would at best be "â€�", using the 'replace'
error handler. In web-1252, this would be "â€\x9d". The web-1252 version is
more common.

Of course, since Python itself is widespread, there is some survivorship
bias here. Another thing you could get instead of "â€�" is your code
crashing.

On Thu, 11 Jan 2018 at 12:20 Random832 <random832@fastmail.com> wrote:
...
On Thu, Jan 11, 2018, at 03:58, M.-A. Lemburg wrote:
...
There's a problem with these encodings: they are mostly meant
for decoding (broken) data, but as soon as we have them in the stdlib,
people will also start using them for encoding data, producing more
corrupted data.
Is it really corrupted?
...
Do you really things it's a good idea to support this natively
in Python ?
The problem is, that's ignoring the very real fact that this is, and has
always been* the behavior of the native encodings built in to Windows. My
opinion is that Microsoft, for whatever reason, misrepresented their
encodings when they submitted them to Unicode. The native APIs for text
conversion have mechanisms for error reporting, and these supposedly
undefined characters do not trigger them as they do for e.g. CP932 0xA0.
Without the MB_ERR_INVALID_CHARS flag, cp932 0xA0 maps to U+F8F0 (private
use), a best fit mapping, and cp1252 0x81 maps to U+0081 (one of the
mappings being discussed here)
If you do set the MB_ERR_INVALID_CHARS flag, however, cp932 0xA0 returns
an error 1113** (ERROR_NO_UNICODE_TRANSLATION), whereas cp1252 0x81 still
returns U+0081.
As far as the actual encoding implemented in windows is concerned,
CP1252's 0x81->U+0081 mapping is a wholly valid one (though undocumented),
and not in any way a fallback or a "best fit" or an invalid character.
*except for the addition of the Euro sign to each encoding at typically
0x80 in circa 1998.
**It's worth mentioning that our cp932 returns U+F8F0, even with
errors='strict', despite this not being present in the unicode published
mapping. It has done this at least since the CJKCodecs change in 2004. I
can't determine where (or if) it was implemented at all before that.
_______________________________________________
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] Support WHATWG versions of legacy encodings

Rob Speer