UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 10442: character maps to <undefined>

Tue May 29 08:44:28 EDT 2018

On Tue, 29 May 2018 10:34:50 +0200, Peter J. Holzer wrote:

> On 2018-05-23 06:03:38 +0000, Steven D'Aprano wrote:
>> On Wed, 23 May 2018 00:31:03 +0200, Peter J. Holzer wrote:
>> > On 2018-05-23 07:38:27 +1000, Chris Angelico wrote:
>> >> You can find an encoding which is capable of decoding a file. That's
>> >> not the same thing.
>> > 
>> > If the result is correct, it is the same thing.
>> 
>> But how do you know what is correct and what isn't? 
[...]
>> If this text is encoding using MacRoman, then decoded in Latin-1, it
>> works, and looks barely any more stupid than the original:
>> 
>>     Max Steele strained his mighty thews against his bonds, but the
>>     ¤-rays had left him as weak as a kitten. The evil Galactic Emperor,
>>     Giµx-ÍÄin The Terrible of the planet Îe¶´, laughed: "I have you
>>     now, Steele, and by this time tomorrow my armies will have overrun
>>     your pitiful Earth defences!"
>> 
>> but it clearly isn't the original text.
> 
> Please note that I wrote "almost always", not "always". It is of course
> possible to construct contrived examples where it is impossible to find
> the correct encoding, because all encodings lead to equally ludicrous
> results.

Whether they are ludicrous is not the point, the point is whether it is 
the original text intended.

What you describe works for the EASY cases: you have a small number of 
text files in some human-readable language, the text files are all valid 
texts in that language, and you have an expert in that language on hand 
able to distinguish between such valid and invalid decoded texts.

If that applies for your text files, great, you have nothing to fear from 
encoding issues! Even if the supplier of the files wouldn't know ASCII 
from EBCDIC if it fell on them from a great height, you can probably make 
an educated guess what the encoding is. Wonderful.

But that's not always the case. In the real world, especially now that we 
interchange documents from all over the world, it isn't the hard cases 
that are contrived. Depending on the type of document (e.g. web pages you 
scrape are probably different from emails, which are different from 
commercial CSV files...) being able to just look at the file and deduce 
the correct encoding is the contrived example.

Depending on where the text is coming from:

- you might not have an expert on hand who can distinguish between
  valid and invalid text;

- you might have to process a large number of files (thousands or
  millions) automatically, and cannot hand-process those that have
  encoding problems;

- your files might not even be in a single consistent encoding, or
  may have Mojibake introduced at some earlier point that you do not
  have control over;

- you might not know what language the text is supposed to be;

- or it might contain isolated words in some unknown language;

  e.g. your text might be nearly all ASCII English, except for a word
  "Čezare" (if using the Czech Kamenický encoding) or "Çezare" (if
  using the Polish Mazovia encoding) or "Äezare" (Mac Roman).

  How many languages do you need to check to determine which is 
  correct? (Hint: all three words are valid.)

- not all encoding problems are as equally easy to resolve as
  your earlier German/Russian example.

E.g. Like Japanese, Russian has a number of incompatible and popular 
encodings. Mojibake is a Japanese term, but the Russians have their own 
word for it: krakozyabry (кракозя́бры).

Dealing with bad data is *hard*.

https://www.safaribooksonline.com/library/view/bad-data-
handbook/9781449324957/ch04.html

-- 
Steven D'Aprano
"Ever since I learned about confirmation bias, I've been seeing
it everywhere." -- Jon Ronson