Unicode

Dave Angel d at davea.name
Mon Dec 17 22:09:04 CET 2012


On 12/17/2012 03:00 PM, Anatoli Hristov wrote:
>> I fixed the print, I changed the setting of the terminal and also on
>> the sshconfig, so now when I print I'm able to print out without
>> problems, but when I tried to run the script I've made it gives me
>> again the same error :
>> ""Unexpected error: exceptions.UnicodeEncodeError
>> """
That's not the whole error message. What encoding does it report in the
error?

Maybe I will try to update to 2.7

> Upgraded to python 27 and still it gives Unexpected error:
> exceptions.UnicodeEncodeError. Damn encoders I don'y know what to
> do...

I doubted that 2.7 would make any difference.

1. What does your "terminal' expect. (For all I know you're using
TeraTermPro as a terminal, which doesn't support utf-8.)
Have you looked at the terminal encoding to see what your copy of
Terminal is expecting? On my Ubuntu Linux, I open the terminal with
Ctrl-Alt-t, then in the menu bar, I select
Terminal->SetCharacterEncoding->utf-8

2. What does your environment tell Linux to support? At a bash prompt, try
echo $LANG (there are two other environment variables I've seen
reference to, so this aspect is nuts)

Mine says
en_US.UTF-8

3. What does Python think it was told?
import sys
print sys.stdout.encoding

Mine says
UTF-8


I can force a similar error as follows:


import urllib
opener = urllib.FancyURLopener({})
ffr =
opener.open("http://prf.icecat.biz/index.cgi?product_id=%s;mi=start;smi=product;shopname=openICEcat-url;lang=fr"
% (14688538))
src = ffr.read()

out = src.decode("utf-8").encode("latin-1")

Traceback (most recent call last):
File "anatoli3.py", line 9, in <module>
src.decode("utf-8").encode("latin-1")
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2122' in
position 17167: ordinal not in range(256)


And from that it's quite clear that for that particular data, I cannot
use a latin-1 encoder.

So I did a bit of hunting, and I find the offending character is the one
after the word 'Core" in the following quote:

processeurs Intel® Core™ de 3ème génération


The symbol is a trademark symbol and is not part of latin-1. If you're
really stuck with a latin-1 terminal, then you could do something like:

print src.decode("utf-8").encode("latin-1", "ignore")

That says to decode it using utf-8 (because the html declared a utf-8
encoding), and encode it back to latin-1 (because your terminal is stuck
there), then print.


Just realize that once you start using 'ignore' you're going to also
ignore discrepancies that are real. For example, maybe your terminal is
actual something other than either latin-1 or utf-8.


For others that just want to play with a minimal subset:


test = u'processeurs Intel\xae Core\u2122 de 3\xe8me g\xe9n\xe9ration av'
print test
print test.encode("latin-1", "ignore")
print test.encode("latin-1")

produces :

processeurs Intel® Core™ de 3ème génération av
processeurs Intel� Core de 3�me g�n�ration av
Traceback (most recent call last):
File "anatoli3.py", line 22, in <module>
print test.encode("latin-1")
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2122' in
position 23: ordinal not in range(256)




-- 

DaveA




More information about the Python-list mailing list