Re: [Python-Dev] deleting setdefaultencoding iin site.py is evil

27 Aug 2009

      Martin v. Löwis wrote:
...
...
...
The ability to change the default encoding is a misfeature.  There's
essentially no way to write correct Python code in the presence of
this feature.
How so? If every single piece of text in your project is encoded in a
superset of ascii (such as utf-8), why would this be a problem?
I guess I should have said "every single piece of text in your project 
is encoded in a superset of ascii (such as utf-8) or is decoded into a 
unicode object at the application boundaries, such as an incoming http 
request or in the process of parsing a file off disk", in which case:
...
What is "every single piece of text"? Every string occurring in source
code?
Yes.
...
or also every single string that may be read from a file,
Yes.
...
a
socket,
Yes.
...
out of a database,
Yes.
...
or from a user interface?
Yes.

Any others I can say Yes to? ;-)
...
How can you be certain that any string is UTF-8 when doing any
reasonable IO?
Careful checking, and a knowledge for people working on the app's 
development that anything else will result in severe pain, both physical 
and mental ;-)
...
...
Even if you were evil/stupid and mixed encodings, surely all you'd get
is different unicode errors or mayvbe the odd strange character during
display?
One specific problem is dictionaries will stop working correctly if you
set the default encoding to anything but ASCII.
...except they haven't.
...
The reason is that
with UTF-8 as the default encoding, you get
py> u"\u20ac" == u"\u20ac".encode("utf-8")
True
py> hash(u"\u20ac") == hash(u"\u20ac".encode("utf-8"))
False
So objects that compare equal will not hash equal. As a consequence, you
may have two different values for what should be the same key in a
dictionary.
Indeed, but this doesn't happen because the app never has a situation 
where strings and unicodes are put in the same dict. However, it does 
have plenty of situations where lists containing a mixture of utf-8 
encoded strings and unicodes exist, where changing the default encoding 
removes a *lot* of pain.
...
It has worked in your application. See my example above: it is very easy
to create applications that stop working correctly if you use
setdefaultencoding (at all - the only supported value is "latin-1",
since Unicode strings hash the same as byte strings if all characters
are in row 0).
Would anyone object if I added this snippet to the .rst that generates:
http://docs.python.org/library/sys.html

It doesn't seem to be recorded anywhere anyone who's likely to use 
setdefaultencoding is likely to find it...

Chris

-- 
Simplistix - Content Management, Batch Processing & Python Consulting
            - http://www.simplistix.co.uk

Re: [Python-Dev] deleting setdefaultencoding iin site.py is evil

Chris Withers