UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 10442: character maps to <undefined>
Steven D'Aprano
steve+comp.lang.python at pearwood.info
Tue May 29 08:44:28 EDT 2018
On Tue, 29 May 2018 10:34:50 +0200, Peter J. Holzer wrote:
> On 2018-05-23 06:03:38 +0000, Steven D'Aprano wrote:
>> On Wed, 23 May 2018 00:31:03 +0200, Peter J. Holzer wrote:
>> > On 2018-05-23 07:38:27 +1000, Chris Angelico wrote:
>> >> You can find an encoding which is capable of decoding a file. That's
>> >> not the same thing.
>> >
>> > If the result is correct, it is the same thing.
>>
>> But how do you know what is correct and what isn't?
[...]
>> If this text is encoding using MacRoman, then decoded in Latin-1, it
>> works, and looks barely any more stupid than the original:
>>
>> Max Steele strained his mighty thews against his bonds, but the
>> ¤-rays had left him as weak as a kitten. The evil Galactic Emperor,
>> Giµx-ÍÄin The Terrible of the planet Îe¶´, laughed: "I have you
>> now, Steele, and by this time tomorrow my armies will have overrun
>> your pitiful Earth defences!"
>>
>> but it clearly isn't the original text.
>
> Please note that I wrote "almost always", not "always". It is of course
> possible to construct contrived examples where it is impossible to find
> the correct encoding, because all encodings lead to equally ludicrous
> results.
Whether they are ludicrous is not the point, the point is whether it is
the original text intended.
What you describe works for the EASY cases: you have a small number of
text files in some human-readable language, the text files are all valid
texts in that language, and you have an expert in that language on hand
able to distinguish between such valid and invalid decoded texts.
If that applies for your text files, great, you have nothing to fear from
encoding issues! Even if the supplier of the files wouldn't know ASCII
from EBCDIC if it fell on them from a great height, you can probably make
an educated guess what the encoding is. Wonderful.
But that's not always the case. In the real world, especially now that we
interchange documents from all over the world, it isn't the hard cases
that are contrived. Depending on the type of document (e.g. web pages you
scrape are probably different from emails, which are different from
commercial CSV files...) being able to just look at the file and deduce
the correct encoding is the contrived example.
Depending on where the text is coming from:
- you might not have an expert on hand who can distinguish between
valid and invalid text;
- you might have to process a large number of files (thousands or
millions) automatically, and cannot hand-process those that have
encoding problems;
- your files might not even be in a single consistent encoding, or
may have Mojibake introduced at some earlier point that you do not
have control over;
- you might not know what language the text is supposed to be;
- or it might contain isolated words in some unknown language;
e.g. your text might be nearly all ASCII English, except for a word
"Čezare" (if using the Czech Kamenický encoding) or "Çezare" (if
using the Polish Mazovia encoding) or "Äezare" (Mac Roman).
How many languages do you need to check to determine which is
correct? (Hint: all three words are valid.)
- not all encoding problems are as equally easy to resolve as
your earlier German/Russian example.
E.g. Like Japanese, Russian has a number of incompatible and popular
encodings. Mojibake is a Japanese term, but the Russians have their own
word for it: krakozyabry (кракозя́бры).
Dealing with bad data is *hard*.
https://www.safaribooksonline.com/library/view/bad-data-
handbook/9781449324957/ch04.html
--
Steven D'Aprano
"Ever since I learned about confirmation bias, I've been seeing
it everywhere." -- Jon Ronson
More information about the Python-list
mailing list