[Python-Dev] deleting setdefaultencoding iin site.py is evil

Chris Withers chris at simplistix.co.uk
Thu Aug 27 09:42:51 CEST 2009


Martin v. Löwis wrote:
>>> The ability to change the default encoding is a misfeature.  There's
>>> essentially no way to write correct Python code in the presence of
>>> this feature.
>> How so? If every single piece of text in your project is encoded in a
>> superset of ascii (such as utf-8), why would this be a problem?

I guess I should have said "every single piece of text in your project 
is encoded in a superset of ascii (such as utf-8) or is decoded into a 
unicode object at the application boundaries, such as an incoming http 
request or in the process of parsing a file off disk", in which case:

> What is "every single piece of text"? Every string occurring in source
> code? 

Yes.

> or also every single string that may be read from a file,

Yes.

> a
> socket, 

Yes.

> out of a database, 

Yes.

> or from a user interface?

Yes.

Any others I can say Yes to? ;-)

> How can you be certain that any string is UTF-8 when doing any
> reasonable IO?

Careful checking, and a knowledge for people working on the app's 
development that anything else will result in severe pain, both physical 
and mental ;-)

>> Even if you were evil/stupid and mixed encodings, surely all you'd get
>> is different unicode errors or mayvbe the odd strange character during
>> display?
> 
> One specific problem is dictionaries will stop working correctly if you
> set the default encoding to anything but ASCII. 

...except they haven't.

> The reason is that
> with UTF-8 as the default encoding, you get
> 
> py> u"\u20ac" == u"\u20ac".encode("utf-8")
> True
> py> hash(u"\u20ac") == hash(u"\u20ac".encode("utf-8"))
> False
> 
> So objects that compare equal will not hash equal. As a consequence, you
> may have two different values for what should be the same key in a
> dictionary.

Indeed, but this doesn't happen because the app never has a situation 
where strings and unicodes are put in the same dict. However, it does 
have plenty of situations where lists containing a mixture of utf-8 
encoded strings and unicodes exist, where changing the default encoding 
removes a *lot* of pain.

> It has worked in your application. See my example above: it is very easy
> to create applications that stop working correctly if you use
> setdefaultencoding (at all - the only supported value is "latin-1",
> since Unicode strings hash the same as byte strings if all characters
> are in row 0).

Would anyone object if I added this snippet to the .rst that generates:
http://docs.python.org/library/sys.html

It doesn't seem to be recorded anywhere anyone who's likely to use 
setdefaultencoding is likely to find it...

Chris

-- 
Simplistix - Content Management, Batch Processing & Python Consulting
            - http://www.simplistix.co.uk


More information about the Python-Dev mailing list