[Python-Dev] deleting setdefaultencoding iin site.py is evil
Chris Withers
chris at simplistix.co.uk
Thu Aug 27 09:42:51 CEST 2009
Martin v. Löwis wrote:
>>> The ability to change the default encoding is a misfeature. There's
>>> essentially no way to write correct Python code in the presence of
>>> this feature.
>> How so? If every single piece of text in your project is encoded in a
>> superset of ascii (such as utf-8), why would this be a problem?
I guess I should have said "every single piece of text in your project
is encoded in a superset of ascii (such as utf-8) or is decoded into a
unicode object at the application boundaries, such as an incoming http
request or in the process of parsing a file off disk", in which case:
> What is "every single piece of text"? Every string occurring in source
> code?
Yes.
> or also every single string that may be read from a file,
Yes.
> a
> socket,
Yes.
> out of a database,
Yes.
> or from a user interface?
Yes.
Any others I can say Yes to? ;-)
> How can you be certain that any string is UTF-8 when doing any
> reasonable IO?
Careful checking, and a knowledge for people working on the app's
development that anything else will result in severe pain, both physical
and mental ;-)
>> Even if you were evil/stupid and mixed encodings, surely all you'd get
>> is different unicode errors or mayvbe the odd strange character during
>> display?
>
> One specific problem is dictionaries will stop working correctly if you
> set the default encoding to anything but ASCII.
...except they haven't.
> The reason is that
> with UTF-8 as the default encoding, you get
>
> py> u"\u20ac" == u"\u20ac".encode("utf-8")
> True
> py> hash(u"\u20ac") == hash(u"\u20ac".encode("utf-8"))
> False
>
> So objects that compare equal will not hash equal. As a consequence, you
> may have two different values for what should be the same key in a
> dictionary.
Indeed, but this doesn't happen because the app never has a situation
where strings and unicodes are put in the same dict. However, it does
have plenty of situations where lists containing a mixture of utf-8
encoded strings and unicodes exist, where changing the default encoding
removes a *lot* of pain.
> It has worked in your application. See my example above: it is very easy
> to create applications that stop working correctly if you use
> setdefaultencoding (at all - the only supported value is "latin-1",
> since Unicode strings hash the same as byte strings if all characters
> are in row 0).
Would anyone object if I added this snippet to the .rst that generates:
http://docs.python.org/library/sys.html
It doesn't seem to be recorded anywhere anyone who's likely to use
setdefaultencoding is likely to find it...
Chris
--
Simplistix - Content Management, Batch Processing & Python Consulting
- http://www.simplistix.co.uk
More information about the Python-Dev
mailing list