[Tutor] UnicodeEncodeError: 'cp932' codec can't encode character '\xe9' in position
Steven D'Aprano
steve at pearwood.info
Sun Mar 11 10:38:09 CET 2012
On Sat, Mar 10, 2012 at 08:03:18PM -0500, Dave Angel wrote:
> There are just 256 possible characters in cp1252, and 256 in cp932.
CP932 is also known as MS-KANJI or SHIFT-JIS (actually, one of many
variants of SHIFT-JS). It is a multi-byte encoding, which means it has
far more than 256 characters.
http://www.rikai.com/library/kanjitables/kanji_codes.sjis.shtml
http://en.wikipedia.org/wiki/Shift_JIS
The actual problem the OP has got is that the *multi-byte* sequence he
is trying to print is illegal when interpreted as CP932. Personally I
think that's a bug in the terminal, or possibly even print, since he's
not printing bytes but characters, but I haven't given that a lot of
thought so I might be way out of line.
The quick and dirty fix is to change the encoding of his terminal, so
that it no longer tries to interpret the characters printed using CP932.
That will also mean he'll no longer see valid Japanese characters.
But since he appears to be using Windows, I don't know if this is
possible, or easy.
[...]
> You can "solve" the problem by pretending the input file is also cp932
> when you open it. That way you'll get the wrong characters, but no
> errors.
Not so -- there are multi-byte sequences that can't be read in CP932.
>>> b"\xe9x".decode("cp932") # this one works
'騙'
>>> b"\xe9!".decode("cp932") # this one doesn't
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'cp932' codec can't decode bytes in position 0-1:
illegal multibyte sequence
In any case, the error doesn't occur when he reads the data, but when he
prints it. Once the data is read, it is already Unicode text, so he
should be able to print any character. At worst, it will print as a
missing character (a square box or space) rather than the expected
glyph. He shouldn't get a UnicodeDecodeError when printing. I smell a
bug since print shouldn't be decoding anything. (At worst, it needs to
*encode*.)
--
Steven
More information about the Tutor
mailing list