Martin v. Löwis wrote:
The ability to change the default encoding is a misfeature. There's essentially no way to write correct Python code in the presence of this feature. How so? If every single piece of text in your project is encoded in a superset of ascii (such as utf-8), why would this be a problem?
I guess I should have said "every single piece of text in your project is encoded in a superset of ascii (such as utf-8) or is decoded into a unicode object at the application boundaries, such as an incoming http request or in the process of parsing a file off disk", in which case:
What is "every single piece of text"? Every string occurring in source code?
Yes.
or also every single string that may be read from a file,
Yes.
a socket,
Yes.
out of a database,
Yes.
or from a user interface?
Yes. Any others I can say Yes to? ;-)
How can you be certain that any string is UTF-8 when doing any reasonable IO?
Careful checking, and a knowledge for people working on the app's development that anything else will result in severe pain, both physical and mental ;-)
Even if you were evil/stupid and mixed encodings, surely all you'd get is different unicode errors or mayvbe the odd strange character during display?
One specific problem is dictionaries will stop working correctly if you set the default encoding to anything but ASCII.
...except they haven't.
The reason is that with UTF-8 as the default encoding, you get
py> u"\u20ac" == u"\u20ac".encode("utf-8") True py> hash(u"\u20ac") == hash(u"\u20ac".encode("utf-8")) False
So objects that compare equal will not hash equal. As a consequence, you may have two different values for what should be the same key in a dictionary.
Indeed, but this doesn't happen because the app never has a situation where strings and unicodes are put in the same dict. However, it does have plenty of situations where lists containing a mixture of utf-8 encoded strings and unicodes exist, where changing the default encoding removes a *lot* of pain.
It has worked in your application. See my example above: it is very easy to create applications that stop working correctly if you use setdefaultencoding (at all - the only supported value is "latin-1", since Unicode strings hash the same as byte strings if all characters are in row 0).
Would anyone object if I added this snippet to the .rst that generates: http://docs.python.org/library/sys.html It doesn't seem to be recorded anywhere anyone who's likely to use setdefaultencoding is likely to find it... Chris -- Simplistix - Content Management, Batch Processing & Python Consulting - http://www.simplistix.co.uk