Unicode again ... default codec ...

Gabriel Genellina gagsl-py2 at yahoo.com.ar
Thu Oct 22 02:52:39 CEST 2009


En Wed, 21 Oct 2009 06:24:55 -0300, Lele Gaifax <lele at metapensiero.it>
escribió:

> "Gabriel Genellina" <gagsl-py2 at yahoo.com.ar> writes:
>
>> DON'T do that. Really. Changing the default encoding is a horrible,
>> horrible hack and causes a lot of problems.
>> ...
>> More reasons:
>> http://tarekziade.wordpress.com/2008/01/08/syssetdefaultencoding-is-evil/
>> See also this recent thread in python-dev:
>> http://comments.gmane.org/gmane.comp.python.devel/106134
>
> This is a problem that appears quite often, against which I have yet to
> see a general workaround, or even a "safe pattern". I must confess that
> most often I just give up and change the "if 0:" line in
> sitecustomize.py to enable a reasonable default...
>
> A week ago I met another incarnation of the problem that I finally
> solved by reloading the sys module, a very ugly way, don't tell me, and
> I really would like to know a better way of doing it.
>
> The case is simple enough: a unit test started failing miserably, with a
> really strange traceback, and a quick pdb session revealed that the
> culprit was nosetest, when it prints out the name of the test, using
> some variant of "print testfunc.__doc__": since the latter happened to
> be a unicode string containing some accented letters, that piece of
> nosetest's code raised an encoding error, that went untrapped...
>
> I tried to understand the issue, until I found that I was inside a fresh
> new virtualenv with python 2.6 and the sitecustomize wasn't even
> there. So, even if my shell environ was UTF-8 (the system being a Ubuntu
> Jaunty), within that virtualenv Python's stdout encoding was
> 'ascii'. Rightly so, nosetest failed to encode the accented letters to
> that.

That seems to imply that in your "normal" environment you altered the
default encoding to utf-8 -- if so: don't do that!

> I could just rephrase the test __doc__, or remove it, but to avoid
> future noise I decided to go with the deprecated "reload(sys)" trick,
> done as early as possible... damn, it's just a test suite after all!
>
> Is there a "correct" way of dealing with this? What should nosetest
> eventually do to initialize it's sys.output.encoding reflecting the
> system's settings? And on the user side, how could I otherwise fix it (I
> mean, without resorting to the reload())?

nosetest should do nothing special. You should configure the environment
so Python *knows* that your console understands utf-8. Once Python is
aware of the *real* encoding your console is using, sys.stdout.encoding
will be utf-8 automatically and your problem is solved. I don't know how
to do that within virtualenv, but the answer certainly does NOT involve
sys.setdefaultencoding()

On Windows, a "normal" console window on my system uses cp850:


D:\USERDATA\Gabriel>chcp
Tabla de códigos activa: 850

D:\USERDATA\Gabriel>python
Python 2.6.3 (r263rc1:75186, Oct  2 2009, 20:40:30) [MSC v.1500 32 bit
(Intel)]
on win32
Type "help", "copyright", "credits" or "license" for more information.
py> import sys
py> sys.getdefaultencoding()
'ascii'
py> sys.stdout.encoding
'cp850'
py> u = u"áñç"
py> print u
áñç
py> u
u'\xe1\xf1\xe7'
py> u.encode("cp850")
'\xa0\xa4\x87'
py> import unicodedata
py> unicodedata.name(u[0])
'LATIN SMALL LETTER A WITH ACUTE'

I opened another console, changed the code page to 1252 (the one used in
Windows applications; `chcp 1252`) and invoked Python again:

py> import sys
py> sys.getdefaultencoding()
'ascii'
py> sys.stdout.encoding
'cp1252'
py> u = u"áñç"
py> print u
áñç
py> u
u'\xe1\xf1\xe7'
py> u.encode("cp1252")
'\xe1\xf1\xe7'
py> import unicodedata
py> unicodedata.name(u[0])
'LATIN SMALL LETTER A WITH ACUTE'

As you can see, everything works fine without any need to change the
default encoding... Just make sure Python *knows* which encoding is being
used in the console on which it runs. On Ubuntu you may need to set the
LANG environment variable.

-- 
Gabriel Genellina




More information about the Python-list mailing list