UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 10442: character maps to <undefined>
Steven D'Aprano
steve+comp.lang.python at pearwood.info
Wed May 23 02:03:38 EDT 2018
On Wed, 23 May 2018 00:31:03 +0200, Peter J. Holzer wrote:
> On 2018-05-23 07:38:27 +1000, Chris Angelico wrote:
[...]
>> You can find an encoding which is capable of decoding a file. That's
>> not the same thing.
>
> If the result is correct, it is the same thing.
But how do you know what is correct and what isn't? In the most general
case, even if you know the language nominally being used, you might not
be able to recognise good output from bad:
Max Steele strained his mighty thews against his bonds, but
the §-rays had left him as weak as a kitten. The evil Galactic
Emperor, Giµx-Õƒin The Terrible of the planet Œe∂¥, laughed: "I
have you now, Steele, and by this time tomorrow my armies will
have overrun your pitiful Earth defences!"
If this text is encoding using MacRoman, then decoded in Latin-1, it
works, and looks barely any more stupid than the original:
Max Steele strained his mighty thews against his bonds, but
the ¤-rays had left him as weak as a kitten. The evil Galactic
Emperor, Giµx-ÍÄin The Terrible of the planet Îe¶´, laughed: "I
have you now, Steele, and by this time tomorrow my armies will
have overrun your pitiful Earth defences!"
but it clearly isn't the original text.
Mojibake is especially difficult to deal with when you are dealing with
short text snippets like file names or user names which can contain
arbitrary characters, where there is rarely any way to recognise the
"correct" string. If you think Giµx-Õƒin The Terrible is a ludicrous
example of text, you ought to look at user names on web forums.
--
Steve
More information about the Python-list
mailing list