[Python-ideas] Support WHATWG versions of legacy encodings

Rob Speer rspeer at luminoso.com
Thu Jan 11 14:42:45 EST 2018


> The question is rather: how often does web-XXX mojibake happen?

Very often. Particularly web-1252 mixed up with UTF-8.

My ftfy library is tested on data from Twitter and the Common Crawl, both
prime sources of mojibake. One common mojibake sequence is when a right
curly quote is encoded as UTF-8 and decoded as codepage 1252. In Python's
official windows-1252, this would at best be "�", using the 'replace'
error handler. In web-1252, this would be "â€\x9d". The web-1252 version is
more common.

Of course, since Python itself is widespread, there is some survivorship
bias here. Another thing you could get instead of "�" is your code
crashing.

On Thu, 11 Jan 2018 at 12:20 Random832 <random832 at fastmail.com> wrote:

> On Thu, Jan 11, 2018, at 03:58, M.-A. Lemburg wrote:
> > There's a problem with these encodings: they are mostly meant
> > for decoding (broken) data, but as soon as we have them in the stdlib,
> > people will also start using them for encoding data, producing more
> > corrupted data.
>
> Is it really corrupted?
>
> > Do you really things it's a good idea to support this natively
> > in Python ?
>
> The problem is, that's ignoring the very real fact that this is, and has
> always been* the behavior of the native encodings built in to Windows. My
> opinion is that Microsoft, for whatever reason, misrepresented their
> encodings when they submitted them to Unicode. The native APIs for text
> conversion have mechanisms for error reporting, and these supposedly
> undefined characters do not trigger them as they do for e.g. CP932 0xA0.
>
> Without the MB_ERR_INVALID_CHARS flag, cp932 0xA0 maps to U+F8F0 (private
> use), a best fit mapping, and cp1252 0x81 maps to U+0081 (one of the
> mappings being discussed here)
> If you do set the MB_ERR_INVALID_CHARS flag, however, cp932 0xA0 returns
> an error 1113** (ERROR_NO_UNICODE_TRANSLATION), whereas cp1252 0x81 still
> returns U+0081.
>
> As far as the actual encoding implemented in windows is concerned,
> CP1252's 0x81->U+0081 mapping is a wholly valid one (though undocumented),
> and not in any way a fallback or a "best fit" or an invalid character.
>
> *except for the addition of the Euro sign to each encoding at typically
> 0x80 in circa 1998.
> **It's worth mentioning that our cp932 returns U+F8F0, even with
> errors='strict', despite this not being present in the unicode published
> mapping. It has done this at least since the CJKCodecs change in 2004. I
> can't determine where (or if) it was implemented at all before that.
> _______________________________________________
> Python-ideas mailing list
> Python-ideas at python.org
> https://mail.python.org/mailman/listinfo/python-ideas
> Code of Conduct: http://python.org/psf/codeofconduct/
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20180111/36ce111c/attachment.html>


More information about the Python-ideas mailing list