
On 2018-01-11 19:42, Rob Speer wrote:
The question is rather: how often does web-XXX mojibake happen?
Very often. Particularly web-1252 mixed up with UTF-8.
My ftfy library is tested on data from Twitter and the Common Crawl, both prime sources of mojibake. One common mojibake sequence is when a right curly quote is encoded as UTF-8 and decoded as codepage 1252. In Python's official windows-1252, this would at best be "â€�", using the 'replace' error handler. In web-1252, this would be "â€\x9d". The web-1252 version is more common.
Of course, since Python itself is widespread, there is some survivorship bias here. Another thing you could get instead of "�" is your code crashing.
FWIW, I've occasionally seen that kind of mojibake on the news ticker of the BBC News channel. :-( [snip]