Unicode again ... default codec ...

Fri Oct 30 18:33:41 EDT 2009

En Fri, 30 Oct 2009 13:40:14 -0300, zooko <zookog at gmail.com> escribió:
> On Oct 20, 9:50 pm, "Gabriel Genellina" <gagsl-... at yahoo.com.ar>
> wrote:
>
>> DON'T do that. Really. Changing the default encoding is a horrible,
>> horrible hack and causes a lot of problems.
>
> I'm not convinced.  I've read all of the posts and web pages and blog
> entries decrying this practice over the last several years, but as far
> as I can tell the actual harm that can result is limited (as long as
> you set it to utf-8) and the practical benefits are substantial.  This
> is a pattern that I have no problem using:
>
> import sys
> reload(sys)
> sys.setdefaultencoding("utf-8")
>
> The reason this doesn't cause too much harm is that anything that
> would have worked with the original default encoding ('ascii') will
> also work with the new utf-8 default encoding.

Wrong. Dictionaries may start behaving incorrectly, by example. Normally,  
two keys that compare equal cannot coexist in the same dictionary:

py> 1 == 1.0
True
py> d = {}
py> d[1] = '*'
py> d[1.0]
'*'
py> d[1.0] = '$'
py> d
{1: '$'}

1 and 1.0 are the same key, as far as the dictionary is concerned. For  
this to work, both keys must have the same hash:

py> hash(1) == hash(1.0)
True

Now, let's set the default encoding to utf-8:

py> import sys
py> reload(sys)
<module 'sys' (built-in)>
py> sys.setdefaultencoding('utf-8')
py> x = u'á'
py> y = u'á'.encode('utf-8')
py> x
u'\xe1'
py> y
'\xc3\xa1'

(same as y='á' if the source encoding is set to utf-8, but I don't want to  
depend on that). Just to be sure we're dealing with the right character:

py> import unicodedata
py> unicodedata.name(x)
'LATIN SMALL LETTER A WITH ACUTE'
py> unicodedata.name(y.decode('utf-8'))
'LATIN SMALL LETTER A WITH ACUTE'

Now, we can see that both x and y are equal:

py> x == y
True

x is an accented a, y is the same thing encoded using the default  
encoding, both are equal. Fine. Now create a dictionary:

py> d = {}
py> d[x] = '*'
py> d[x]
'*'
py> x in d
True
py> y in d
False            # ???
py> d[y] = 2
py> d
{u'\xe1': '*', '\xc3\xa1': 2} # ????

Since x==y, one should expect a single entry in the dictionary - but we  
got two. That's because:

py> x == y
True
py> hash(x) == hash(y)
False

and this must *not* happen according to  
http://docs.python.org/reference/datamodel.html#object.__hash__
"The only required property is that objects which compare equal have the  
same hash value"
Considering that dictionaries in Python are used almost everywhere,  
breaking this basic asumption is a really bad problem.

Of course, all of this applies to Python 2.x; in Python 3.0 the problem  
was solved differently; strings are unicode by default, and the default  
encoding IS utf-8.

> As far as I've seen
> from the aforementioned mailing list threads and blog posts and so on,
> the worst thing that has ever happened as a result of this technique
> is that something works for you but fails for someone else who doesn't
> have this stanza.  (http://tarekziade.wordpress.com/2008/01/08/
> syssetdefaultencoding-is-evil/ .)  That's bad, but probably just
> including this stanza at the top of the file that you are sharing with
> that other person instead of doing it in a sitecustomize.py file will
> avoid that problem.

And then you break all other libraries that the program is using,  
including the Python standard library, because the default encoding is a  
global setting. What if another library decides to use latin-1 as the  
default encoding, using the same trick? Latest one wins...

You said "the practical benefits are substantial" but I, for myself,  
cannot see any benefit. Perhaps if you post your real problems, someone  
can find the solution.
The right way is to fix your program to do the right thing, not to hide  
the bugs under the rug.

-- 
Gabriel Genellina