
The question is rather: how often does web-XXX mojibake happen?
Very often. Particularly web-1252 mixed up with UTF-8. My ftfy library is tested on data from Twitter and the Common Crawl, both prime sources of mojibake. One common mojibake sequence is when a right curly quote is encoded as UTF-8 and decoded as codepage 1252. In Python's official windows-1252, this would at best be "â€�", using the 'replace' error handler. In web-1252, this would be "â€\x9d". The web-1252 version is more common. Of course, since Python itself is widespread, there is some survivorship bias here. Another thing you could get instead of "â€�" is your code crashing. On Thu, 11 Jan 2018 at 12:20 Random832 <random832@fastmail.com> wrote:
On Thu, Jan 11, 2018, at 03:58, M.-A. Lemburg wrote:
There's a problem with these encodings: they are mostly meant for decoding (broken) data, but as soon as we have them in the stdlib, people will also start using them for encoding data, producing more corrupted data.
Is it really corrupted?
Do you really things it's a good idea to support this natively in Python ?
The problem is, that's ignoring the very real fact that this is, and has always been* the behavior of the native encodings built in to Windows. My opinion is that Microsoft, for whatever reason, misrepresented their encodings when they submitted them to Unicode. The native APIs for text conversion have mechanisms for error reporting, and these supposedly undefined characters do not trigger them as they do for e.g. CP932 0xA0.
Without the MB_ERR_INVALID_CHARS flag, cp932 0xA0 maps to U+F8F0 (private use), a best fit mapping, and cp1252 0x81 maps to U+0081 (one of the mappings being discussed here) If you do set the MB_ERR_INVALID_CHARS flag, however, cp932 0xA0 returns an error 1113** (ERROR_NO_UNICODE_TRANSLATION), whereas cp1252 0x81 still returns U+0081.
As far as the actual encoding implemented in windows is concerned, CP1252's 0x81->U+0081 mapping is a wholly valid one (though undocumented), and not in any way a fallback or a "best fit" or an invalid character.
*except for the addition of the Euro sign to each encoding at typically 0x80 in circa 1998. **It's worth mentioning that our cp932 returns U+F8F0, even with errors='strict', despite this not being present in the unicode published mapping. It has done this at least since the CJKCodecs change in 2004. I can't determine where (or if) it was implemented at all before that. _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/