[Python-ideas] Support WHATWG versions of legacy encodings
python at mrabarnett.plus.com
Thu Jan 11 16:09:22 EST 2018
On 2018-01-11 19:42, Rob Speer wrote:
> > The question is rather: how often does web-XXX mojibake happen?
> Very often. Particularly web-1252 mixed up with UTF-8.
> My ftfy library is tested on data from Twitter and the Common Crawl,
> both prime sources of mojibake. One common mojibake sequence is when a
> right curly quote is encoded as UTF-8 and decoded as codepage 1252. In
> Python's official windows-1252, this would at best be "â€�", using the
> 'replace' error handler. In web-1252, this would be "â€\x9d". The
> web-1252 version is more common.
> Of course, since Python itself is widespread, there is some survivorship
> bias here. Another thing you could get instead of "â€�" is your code
FWIW, I've occasionally seen that kind of mojibake on the news ticker of
the BBC News channel. :-(
More information about the Python-ideas