[Tutor] UnicodeEncodeError: 'cp932' codec can't encode character '\xe9' in position

Sun Mar 11 02:03:18 CET 2012

On 03/10/2012 06:38 PM, Robert Sjoblom wrote:
> Okay, so here's a fun one. Since I'm on a japanese locale my native
> encoding is cp932. I was thinking of writing a parser for a bunch of
> text files, but I stumbled on even printing the contents due to ...
> something. I don't know what encoding the text file uses, which isn't
> helping my case either (I have asked, but I've yet to get an answer).
>
> Okay, so:
>
> address = "C:/Path/to/file/file.ext"
> with open(address, encoding="cp1252") as alpha:
>      text = alpha.readlines()
>      for line in text:
>          print(line)
>
> It starts to print until it hits the wonderful character é or '\xe9',
> where it gives me this happy traceback:
> Traceback (most recent call last):
>    File "C:\Users\Azaz\Desktop\CK2 Map Painter\Parser\test parser.py",
> line 8, in<module>
>      print(line)
> UnicodeEncodeError: 'cp932' codec can't encode character '\xe9' in
> position 13: illegal multibyte sequence
>
> I can open the document and view it in UltraEdit -- and it displays
> correct characters there -- but UE can't give me what encoding it
> uses. Any chance of solving this without having to switch from my
> japanese locale? Also, the cp1252 is just an educated guess, but it
> doesn't really matter because it always comes back to the cp932 error.
>

There are just 256 possible characters in cp1252, and 256 in cp932.  So 
you should expect to see this error if your input file is 
unconstrained.  And since you don't know what encoding it's in, you 
might as well consider it unconstrained.

In other words, there are possible characters in the cp1252 that just 
won't display in cp932.

You can "solve" the problem by pretending the input file is also cp932 
when you open it. That way you'll get the wrong characters, but no 
errors.  Or you can solve it by encoding the output explicitly, telling 
it to ignore errors.  I don't know how to do that in Python 3.x.  
Finally, you can change your console to be utf-8, and find a font that 
includes both sets of characters.

-- 

DaveA