deleting setdefaultencoding iin site.py is evil

Hi All, Would anyone object if I removed the deletion of of sys.setdefaultencoding in site.py? I'm guessing "yes!" so thought I'd state my reasons now: This deletion appears to be pretty flimsy; reload(sys) and you have it back. Which is lucky, because I need it after it's been deleted... Why? Well, because you can no longer put sitecustomize.py in a project-specific location (http://bugs.python.org/issue1734860) and because for some projects the only way I can deal with encoded strings sensibly is to use setdefaultencoding, in my case at the start of a script generated by zc.buildout's zc.recipe.egg (I *know* all the encodings in this project are utf-8, but I don't want to go playing whack-a-mole with whatever modules this rather large project uses that haven't been made properly unicode aware). Yes, it needs to be used as early as possible, and the docs should say this, but deleting it seems to be petty in terms of stopping its use when sitecustomize.py is too early and too system-wide and spraying .decode('utf-8')'s all over a code base made up of a load of eggs managed by buildout simply isn't feasible... Thoughts? Chris -- Simplistix - Content Management, Batch Processing & Python Consulting - http://www.simplistix.co.uk

On 04:08 pm, chris@simplistix.co.uk wrote:
The ability to change the default encoding is a misfeature. There's essentially no way to write correct Python code in the presence of this feature. Using setdefaultencoding is never the sensible way to deal with encoded strings. Actually exposing this function in the sys module would lead all kinds of people who haven't fully grasped the way str, unicode, and encodings work to doing horrible things to create broken programs. It's bad enough that it's already possible to get this function back with the reload(sys) trick.
It may be a major task, but the best thing you can do is find each str and unicode operation in the software you're working with and make them correct with respect to your inputs and outputs. Flipping a giant switch for the entire process is just going to change which things are wrong. Jean-Paul

exarkun@twistedmatrix.com wrote:
How so? If every single piece of text in your project is encoded in a superset of ascii (such as utf-8), why would this be a problem? Even if you were evil/stupid and mixed encodings, surely all you'd get is different unicode errors or mayvbe the odd strange character during display?
Well, flipping that giant switch has worked in production for the past 5 years, so I'm afraid I'll respectfully disagree. I'd suspect the pragmatics of real world software are with that function even exists, and it's extremely useful when used correctly... Chris -- Simplistix - Content Management, Batch Processing & Python Consulting - http://www.simplistix.co.uk

What is "every single piece of text"? Every string occurring in source code? or also every single string that may be read from a file, a socket, out of a database, or from a user interface? How can you be certain that any string is UTF-8 when doing any reasonable IO?
One specific problem is dictionaries will stop working correctly if you set the default encoding to anything but ASCII. The reason is that with UTF-8 as the default encoding, you get py> u"\u20ac" == u"\u20ac".encode("utf-8") True py> hash(u"\u20ac") == hash(u"\u20ac".encode("utf-8")) False So objects that compare equal will not hash equal. As a consequence, you may have two different values for what should be the same key in a dictionary.
It has worked in your application. See my example above: it is very easy to create applications that stop working correctly if you use setdefaultencoding (at all - the only supported value is "latin-1", since Unicode strings hash the same as byte strings if all characters are in row 0). Regards, Martin

Martin v. Löwis wrote:
I guess I should have said "every single piece of text in your project is encoded in a superset of ascii (such as utf-8) or is decoded into a unicode object at the application boundaries, such as an incoming http request or in the process of parsing a file off disk", in which case:
What is "every single piece of text"? Every string occurring in source code?
Yes.
or also every single string that may be read from a file,
Yes.
a socket,
Yes.
out of a database,
Yes.
or from a user interface?
Yes. Any others I can say Yes to? ;-)
How can you be certain that any string is UTF-8 when doing any reasonable IO?
Careful checking, and a knowledge for people working on the app's development that anything else will result in severe pain, both physical and mental ;-)
...except they haven't.
Indeed, but this doesn't happen because the app never has a situation where strings and unicodes are put in the same dict. However, it does have plenty of situations where lists containing a mixture of utf-8 encoded strings and unicodes exist, where changing the default encoding removes a *lot* of pain.
Would anyone object if I added this snippet to the .rst that generates: http://docs.python.org/library/sys.html It doesn't seem to be recorded anywhere anyone who's likely to use setdefaultencoding is likely to find it... Chris -- Simplistix - Content Management, Batch Processing & Python Consulting - http://www.simplistix.co.uk

In your application. Can you please agree that this a semantical problem that is completely unacceptable for language design?
So you should convert all byte strings to UTF-8 before adding them to the list. Assuming you have used proper encapsulation and object-oriented design, it shouldn't be too difficult to find, for each such list, where the places are that modify the list.
Would anyone object if I added this snippet to the .rst that generates: http://docs.python.org/library/sys.html
The snippet explaining the problem? I don't mind, but Raymond is on record for objecting to any addition of a warning box to the documentation, because it gives the impression that Python is full of problems, when many these warnings really refer to boundary cases only. Regards, Martin

Chris Withers writes:
If you're *that* careful, the additional effort to hack around this is negligible. The problem is that most people are *never* that careful, and *all* people are rarely that careful. I understand your use case, but I don't see a case for exposing this to the general public.

On 26 Aug, 11:51 pm, chris@simplistix.co.uk wrote:
This is what I meant when I said what I said about correct code. If you're happy to have encoding errors and corrupt data, then I guess you're happy to have a function like setdefaultencoding.
I suppose it's fortunate for you that the function exists, then. For my part, I have managed to write and operate a lot of code in production for at least as long without ever touching it. Generally speaking, I also don't find that I encounter lots of unicode errors or corrupted data (*sometimes* I do; in those cases, I fix the broken code and it doesn't happen again). Jean-Paul

On Aug 27, 2009, at 9:08 AM, exarkun@twistedmatrix.com wrote:
Whatever happened to "we're all adults here"[1]? I have no problem with making it difficult but possible to write buggy but practical code. Software engineering is a messy business. -Barry [1] That may not be literally true any more, but still :)

2009/8/27 Barry Warsaw <barry@python.org>:
Being adults about it also means when to give up. Chris, please stop arguing about this. There are plenty of techniques you can use to get what you want without changing Python, for example virtualenv, which allows you to create a custom Python environment for each project. Or you could switch to Python 3.1, whose different approach to distinguishing between encoded and decoded string means that you won't have to worry about the default encoding quite as much (and you are free to change the default *filesystem* encoding in Py3k). Or you could invoke python -S, which skips site.py and sitecustomize.py, so you are free to mess up any way you want. The fundamental reason the designers of Python's 2.x standard library don't want you to be able to set the default encoding in your app, is that the standard library is written with the assumption that the default encoding is fixed, and no guarantees about the correct workings of the standard library can be made when you change it. There are no tests for this situation. Nobody knows what will fail when. And you (or worse, your users) *will* come back to us with complaints if the standard library suddenly starts doing things you didn't expect. -- --Guido van Rossum (home page: http://www.python.org/~guido/)

Guido van Rossum wrote:
Being adults about it also means when to give up. Chris, please stop arguing about this.
Sure. Even if people had agreed to this change, it wouldn't end up in a python release I could use for this project.
Yep, I'll resort to wrapping the buildout in a virtualenv iff the reload(sys) hack ends up causing problems...
Or you could switch to Python 3.1,
I would love to, once Python 3 has a viable web app story... cheers, Chris -- Simplistix - Content Management, Batch Processing & Python Consulting - http://www.simplistix.co.uk

Robert Brewer wrote:
My understanding was that the wsgi spec for Python 3 wasn't finished... Chris -- Simplistix - Content Management, Batch Processing & Python Consulting - http://www.simplistix.co.uk

Chris Withers wrote:
The WSGI 1.0 spec has always included Python 3 using unicode strings in the environ (decoded via ISO-8859-1, and limited to \x00-\xFF). In addition, the CherryPy and mod_wsgi teams are working to interoperably support a modified version of WSGI, in which the environ is true unicode for both Python 2 and 3. We hope these implementations become references from which a WSGI 1.1 spec can be written; since web-sig has not yet reached consensus on certain specification details, we are proceeding together with tools that allow users to get work done now. Robert Brewer fumanchu@aminus.org

On Mon, Aug 31, 2009 at 7:49 AM, Robert Brewer<fumanchu@aminus.org> wrote:
CherryPy 3.2 is now in beta, and mod_wsgi is nearly ready as well. Both support Python 3. :)
Excellent news! I just saw that PyYAML also suports 3.1. Slowly but surely, 3.1 is gaining traction... -- --Guido van Rossum (home page: http://www.python.org/~guido/)

Chris Withers wrote:
Let's look at this from another angle: sys.setdefaultencoding() is only made available for use in site.py. This is documented and by design (since a site may want to set the default encoding based on the locale or to "utf-8"). If you use it anywhere else, you're on your own. Such usage is not supported and may very well break your interpreter or cause data corruption (the default encoded versions of Unicode objects are cached inside the objects). Now, in your particular case, you're probably better off just tweaking site.py directly in your custom Python interpreter rather than relying on sitecustomize.py (see setencoding() in site.py). To answer your question: yes, this particular API may not be used outside site.py. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Aug 25 2009)
::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

In retrospect, it should have been called sys._setdefaultencoding(). That sends an extra signal that it's not meant for general use. -- --Guido van Rossum (home page: http://www.python.org/~guido/)

On 2009-08-25 12:37 PM, Guido van Rossum wrote:
In retrospect, it should have been called sys._setdefaultencoding(). That sends an extra signal that it's not meant for general use.
Considering all of the sys._getframe() hacks out there, I suspect that this would encourage more abuse of the function than the current situation. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco

On Tue, Aug 25, 2009 at 11:10 AM, Robert Kern<robert.kern@gmail.com> wrote:
Why? It would still be deleted by site.py. The abuse of sys._getframe() exists because it fills a real need. (As does abuse of sys.setdefaultencoding(). However abusing it is actually more troublesome, because the problems are much less theoretical.) -- --Guido van Rossum (home page: http://www.python.org/~guido/)

On 2009-08-25 13:29 PM, Guido van Rossum wrote:
Ah, yes. You're right. For whatever reason I thought it lived as site.setdefaultencoding() when I read your message and thought that you were proposing to move it to sys. Never mind me. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco

Guido van Rossum wrote:
In retrospect, it should have been called sys._setdefaultencoding(). That sends an extra signal that it's not meant for general use.
Crazy idea: how about mutating it into sys._setdefaultencoding rather than deleting it? Chris -- Simplistix - Content Management, Batch Processing & Python Consulting - http://www.simplistix.co.uk

Martin v. Löwis wrote:
How is it breaking backwards compatibility? - If people were somehow relying on sys.setdefaultencoding to be deleted, that's fine, it's still gone - If people were somehow relying on sys not having an attribute called _setdefaultencoding, or were relying on stuffing an attribute into sys called _setdefaultencoding then... well... that seems pretty unlikely ;-) Chris -- Simplistix - Content Management, Batch Processing & Python Consulting - http://www.simplistix.co.uk

Martin v. Löwis wrote:
No it doesn't: $ svn diff Index: Lib/site.py =================================================================== --- Lib/site.py (revision 74552) +++ Lib/site.py (working copy) @@ -540,6 +540,7 @@ if hasattr(sys, "setdefaultencoding"): + sys._setdefaultencoding = sys.setdefaultencoding del sys.setdefaultencoding
Chris -- Simplistix - Content Management, Batch Processing & Python Consulting - http://www.simplistix.co.uk

Ah, so you didn't want to rename the function. I agree that this would not break backwards compatibility. I guess the basic objection remains: making it so would make _setdefaultencoding a supported feature, which would then mean that we should fix all the bugs that it causes - when we already know (because we thought many years about this) that it is not possible to implement setdefaultencoding correctly and efficiently (so the current implementation is only efficient, but not correct). Regards, Martin

M.-A. Lemburg wrote:
Let's look at this from another angle: sys.setdefaultencoding() is only made available for use in site.py.
...see this: http://mail.python.org/pipermail/python-dev/2009-August/091391.html I would like to use sitecustomize.py for all the very good reasons given in this thread: - I don't want to change the default encoding for every project that uses the python installation in question - I don't even want to change the default encoding for every python script run by the current user - I only want to change the default encoding for one particular project. Sadly, for the reasons I describe in the thread, site.py won't find a sitecustomize.py in this situation...
If you use it anywhere else, you're on your own.
No problem with that. To be specific, this is a Zope 2.12 instance driven by this buildout: [instance] recipe = zc.recipe.egg eggs = ${buildout:eggs} interpreter = py entry-points= runzope=Zope2.Startup.run:run zopectl=Zope2.Startup.zopectl:main scripts = runzope zopectl initialization = import sys reload(sys) sys.setdefaultencoding('utf-8') sys.argv[1:1] = ['-C','${buildout:directory}/etc/instance.conf'] The call to sys.setdefaultencoding is *very* early in the scheme of things... The runzope script that gets run only has some sys.path manipulation before sys.setdefaultencoding gets called. What problems could there be by calling sys.setdefaultencoding there?
Such usage is not supported and may very well break your interpreter
Can you give an example?
When called as early as in the above script, what objects would have encoded strings cached in them?
Why? Chris -- Simplistix - Content Management, Batch Processing & Python Consulting - http://www.simplistix.co.uk

Chris Withers wrote:
You can get strange effects caused by the fact that some string objects will now compare equal while not necessarily having the same hash value. Unicode objects and strings have the same hash value provided that they are both ASCII. With the ASCII default encoding, a non-ASCII string cannot be compared to a Unicode object, so the problem does not occur.
Difficult to say. This depends a lot on the environment where you are running the script. Note that the codecs are loaded at a very early stage in the interpreter startup and a lot of them do use Unicode strings. This wasn't the case in Python 1.6 when the whole site.py approach to setting the default encoding was designed, but added later on, in Python 2.1 IIRC, when noone really considered using a different default encoding anymore. Using UTF-8 as new default encoding will not cause much trouble with this, since it is an ASCII superset. However, changing it more than once will cause the earlier Unicode objects to still use the old default encoding value. Using a different non-ASCII compatible encoding, such as UTF-16, will cause breakage for the same reason. The default encoded string version of a Unicode object is cached in the object and never recreated after it has first been successfully encoded. When only changing the default encoding once and using UTF-8 as the new default encoding, you'll only run into the hash value problem. If that's not an issue for your application, e.g. you don't mix Unicode and string key objects in your dictionaries and don't rely on the special relationship between hashes and comparisons elsewhere, you should be fine.
To get the job done :-) You could rewrite setencoding() to get the encoding information from e.g. an os.environ variable or some config file. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Aug 27 2009)
::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

On 04:08 pm, chris@simplistix.co.uk wrote:
The ability to change the default encoding is a misfeature. There's essentially no way to write correct Python code in the presence of this feature. Using setdefaultencoding is never the sensible way to deal with encoded strings. Actually exposing this function in the sys module would lead all kinds of people who haven't fully grasped the way str, unicode, and encodings work to doing horrible things to create broken programs. It's bad enough that it's already possible to get this function back with the reload(sys) trick.
It may be a major task, but the best thing you can do is find each str and unicode operation in the software you're working with and make them correct with respect to your inputs and outputs. Flipping a giant switch for the entire process is just going to change which things are wrong. Jean-Paul

exarkun@twistedmatrix.com wrote:
How so? If every single piece of text in your project is encoded in a superset of ascii (such as utf-8), why would this be a problem? Even if you were evil/stupid and mixed encodings, surely all you'd get is different unicode errors or mayvbe the odd strange character during display?
Well, flipping that giant switch has worked in production for the past 5 years, so I'm afraid I'll respectfully disagree. I'd suspect the pragmatics of real world software are with that function even exists, and it's extremely useful when used correctly... Chris -- Simplistix - Content Management, Batch Processing & Python Consulting - http://www.simplistix.co.uk

What is "every single piece of text"? Every string occurring in source code? or also every single string that may be read from a file, a socket, out of a database, or from a user interface? How can you be certain that any string is UTF-8 when doing any reasonable IO?
One specific problem is dictionaries will stop working correctly if you set the default encoding to anything but ASCII. The reason is that with UTF-8 as the default encoding, you get py> u"\u20ac" == u"\u20ac".encode("utf-8") True py> hash(u"\u20ac") == hash(u"\u20ac".encode("utf-8")) False So objects that compare equal will not hash equal. As a consequence, you may have two different values for what should be the same key in a dictionary.
It has worked in your application. See my example above: it is very easy to create applications that stop working correctly if you use setdefaultencoding (at all - the only supported value is "latin-1", since Unicode strings hash the same as byte strings if all characters are in row 0). Regards, Martin

Martin v. Löwis wrote:
I guess I should have said "every single piece of text in your project is encoded in a superset of ascii (such as utf-8) or is decoded into a unicode object at the application boundaries, such as an incoming http request or in the process of parsing a file off disk", in which case:
What is "every single piece of text"? Every string occurring in source code?
Yes.
or also every single string that may be read from a file,
Yes.
a socket,
Yes.
out of a database,
Yes.
or from a user interface?
Yes. Any others I can say Yes to? ;-)
How can you be certain that any string is UTF-8 when doing any reasonable IO?
Careful checking, and a knowledge for people working on the app's development that anything else will result in severe pain, both physical and mental ;-)
...except they haven't.
Indeed, but this doesn't happen because the app never has a situation where strings and unicodes are put in the same dict. However, it does have plenty of situations where lists containing a mixture of utf-8 encoded strings and unicodes exist, where changing the default encoding removes a *lot* of pain.
Would anyone object if I added this snippet to the .rst that generates: http://docs.python.org/library/sys.html It doesn't seem to be recorded anywhere anyone who's likely to use setdefaultencoding is likely to find it... Chris -- Simplistix - Content Management, Batch Processing & Python Consulting - http://www.simplistix.co.uk

In your application. Can you please agree that this a semantical problem that is completely unacceptable for language design?
So you should convert all byte strings to UTF-8 before adding them to the list. Assuming you have used proper encapsulation and object-oriented design, it shouldn't be too difficult to find, for each such list, where the places are that modify the list.
Would anyone object if I added this snippet to the .rst that generates: http://docs.python.org/library/sys.html
The snippet explaining the problem? I don't mind, but Raymond is on record for objecting to any addition of a warning box to the documentation, because it gives the impression that Python is full of problems, when many these warnings really refer to boundary cases only. Regards, Martin

Chris Withers writes:
If you're *that* careful, the additional effort to hack around this is negligible. The problem is that most people are *never* that careful, and *all* people are rarely that careful. I understand your use case, but I don't see a case for exposing this to the general public.

On 26 Aug, 11:51 pm, chris@simplistix.co.uk wrote:
This is what I meant when I said what I said about correct code. If you're happy to have encoding errors and corrupt data, then I guess you're happy to have a function like setdefaultencoding.
I suppose it's fortunate for you that the function exists, then. For my part, I have managed to write and operate a lot of code in production for at least as long without ever touching it. Generally speaking, I also don't find that I encounter lots of unicode errors or corrupted data (*sometimes* I do; in those cases, I fix the broken code and it doesn't happen again). Jean-Paul

On Aug 27, 2009, at 9:08 AM, exarkun@twistedmatrix.com wrote:
Whatever happened to "we're all adults here"[1]? I have no problem with making it difficult but possible to write buggy but practical code. Software engineering is a messy business. -Barry [1] That may not be literally true any more, but still :)

2009/8/27 Barry Warsaw <barry@python.org>:
Being adults about it also means when to give up. Chris, please stop arguing about this. There are plenty of techniques you can use to get what you want without changing Python, for example virtualenv, which allows you to create a custom Python environment for each project. Or you could switch to Python 3.1, whose different approach to distinguishing between encoded and decoded string means that you won't have to worry about the default encoding quite as much (and you are free to change the default *filesystem* encoding in Py3k). Or you could invoke python -S, which skips site.py and sitecustomize.py, so you are free to mess up any way you want. The fundamental reason the designers of Python's 2.x standard library don't want you to be able to set the default encoding in your app, is that the standard library is written with the assumption that the default encoding is fixed, and no guarantees about the correct workings of the standard library can be made when you change it. There are no tests for this situation. Nobody knows what will fail when. And you (or worse, your users) *will* come back to us with complaints if the standard library suddenly starts doing things you didn't expect. -- --Guido van Rossum (home page: http://www.python.org/~guido/)

Guido van Rossum wrote:
Being adults about it also means when to give up. Chris, please stop arguing about this.
Sure. Even if people had agreed to this change, it wouldn't end up in a python release I could use for this project.
Yep, I'll resort to wrapping the buildout in a virtualenv iff the reload(sys) hack ends up causing problems...
Or you could switch to Python 3.1,
I would love to, once Python 3 has a viable web app story... cheers, Chris -- Simplistix - Content Management, Batch Processing & Python Consulting - http://www.simplistix.co.uk

Robert Brewer wrote:
My understanding was that the wsgi spec for Python 3 wasn't finished... Chris -- Simplistix - Content Management, Batch Processing & Python Consulting - http://www.simplistix.co.uk

Chris Withers wrote:
The WSGI 1.0 spec has always included Python 3 using unicode strings in the environ (decoded via ISO-8859-1, and limited to \x00-\xFF). In addition, the CherryPy and mod_wsgi teams are working to interoperably support a modified version of WSGI, in which the environ is true unicode for both Python 2 and 3. We hope these implementations become references from which a WSGI 1.1 spec can be written; since web-sig has not yet reached consensus on certain specification details, we are proceeding together with tools that allow users to get work done now. Robert Brewer fumanchu@aminus.org

On Mon, Aug 31, 2009 at 7:49 AM, Robert Brewer<fumanchu@aminus.org> wrote:
CherryPy 3.2 is now in beta, and mod_wsgi is nearly ready as well. Both support Python 3. :)
Excellent news! I just saw that PyYAML also suports 3.1. Slowly but surely, 3.1 is gaining traction... -- --Guido van Rossum (home page: http://www.python.org/~guido/)

Chris Withers wrote:
Let's look at this from another angle: sys.setdefaultencoding() is only made available for use in site.py. This is documented and by design (since a site may want to set the default encoding based on the locale or to "utf-8"). If you use it anywhere else, you're on your own. Such usage is not supported and may very well break your interpreter or cause data corruption (the default encoded versions of Unicode objects are cached inside the objects). Now, in your particular case, you're probably better off just tweaking site.py directly in your custom Python interpreter rather than relying on sitecustomize.py (see setencoding() in site.py). To answer your question: yes, this particular API may not be used outside site.py. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Aug 25 2009)
::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

In retrospect, it should have been called sys._setdefaultencoding(). That sends an extra signal that it's not meant for general use. -- --Guido van Rossum (home page: http://www.python.org/~guido/)

On 2009-08-25 12:37 PM, Guido van Rossum wrote:
In retrospect, it should have been called sys._setdefaultencoding(). That sends an extra signal that it's not meant for general use.
Considering all of the sys._getframe() hacks out there, I suspect that this would encourage more abuse of the function than the current situation. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco

On Tue, Aug 25, 2009 at 11:10 AM, Robert Kern<robert.kern@gmail.com> wrote:
Why? It would still be deleted by site.py. The abuse of sys._getframe() exists because it fills a real need. (As does abuse of sys.setdefaultencoding(). However abusing it is actually more troublesome, because the problems are much less theoretical.) -- --Guido van Rossum (home page: http://www.python.org/~guido/)

On 2009-08-25 13:29 PM, Guido van Rossum wrote:
Ah, yes. You're right. For whatever reason I thought it lived as site.setdefaultencoding() when I read your message and thought that you were proposing to move it to sys. Never mind me. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco

Guido van Rossum wrote:
In retrospect, it should have been called sys._setdefaultencoding(). That sends an extra signal that it's not meant for general use.
Crazy idea: how about mutating it into sys._setdefaultencoding rather than deleting it? Chris -- Simplistix - Content Management, Batch Processing & Python Consulting - http://www.simplistix.co.uk

Martin v. Löwis wrote:
How is it breaking backwards compatibility? - If people were somehow relying on sys.setdefaultencoding to be deleted, that's fine, it's still gone - If people were somehow relying on sys not having an attribute called _setdefaultencoding, or were relying on stuffing an attribute into sys called _setdefaultencoding then... well... that seems pretty unlikely ;-) Chris -- Simplistix - Content Management, Batch Processing & Python Consulting - http://www.simplistix.co.uk

Martin v. Löwis wrote:
No it doesn't: $ svn diff Index: Lib/site.py =================================================================== --- Lib/site.py (revision 74552) +++ Lib/site.py (working copy) @@ -540,6 +540,7 @@ if hasattr(sys, "setdefaultencoding"): + sys._setdefaultencoding = sys.setdefaultencoding del sys.setdefaultencoding
Chris -- Simplistix - Content Management, Batch Processing & Python Consulting - http://www.simplistix.co.uk

Ah, so you didn't want to rename the function. I agree that this would not break backwards compatibility. I guess the basic objection remains: making it so would make _setdefaultencoding a supported feature, which would then mean that we should fix all the bugs that it causes - when we already know (because we thought many years about this) that it is not possible to implement setdefaultencoding correctly and efficiently (so the current implementation is only efficient, but not correct). Regards, Martin

M.-A. Lemburg wrote:
Let's look at this from another angle: sys.setdefaultencoding() is only made available for use in site.py.
...see this: http://mail.python.org/pipermail/python-dev/2009-August/091391.html I would like to use sitecustomize.py for all the very good reasons given in this thread: - I don't want to change the default encoding for every project that uses the python installation in question - I don't even want to change the default encoding for every python script run by the current user - I only want to change the default encoding for one particular project. Sadly, for the reasons I describe in the thread, site.py won't find a sitecustomize.py in this situation...
If you use it anywhere else, you're on your own.
No problem with that. To be specific, this is a Zope 2.12 instance driven by this buildout: [instance] recipe = zc.recipe.egg eggs = ${buildout:eggs} interpreter = py entry-points= runzope=Zope2.Startup.run:run zopectl=Zope2.Startup.zopectl:main scripts = runzope zopectl initialization = import sys reload(sys) sys.setdefaultencoding('utf-8') sys.argv[1:1] = ['-C','${buildout:directory}/etc/instance.conf'] The call to sys.setdefaultencoding is *very* early in the scheme of things... The runzope script that gets run only has some sys.path manipulation before sys.setdefaultencoding gets called. What problems could there be by calling sys.setdefaultencoding there?
Such usage is not supported and may very well break your interpreter
Can you give an example?
When called as early as in the above script, what objects would have encoded strings cached in them?
Why? Chris -- Simplistix - Content Management, Batch Processing & Python Consulting - http://www.simplistix.co.uk

Chris Withers wrote:
You can get strange effects caused by the fact that some string objects will now compare equal while not necessarily having the same hash value. Unicode objects and strings have the same hash value provided that they are both ASCII. With the ASCII default encoding, a non-ASCII string cannot be compared to a Unicode object, so the problem does not occur.
Difficult to say. This depends a lot on the environment where you are running the script. Note that the codecs are loaded at a very early stage in the interpreter startup and a lot of them do use Unicode strings. This wasn't the case in Python 1.6 when the whole site.py approach to setting the default encoding was designed, but added later on, in Python 2.1 IIRC, when noone really considered using a different default encoding anymore. Using UTF-8 as new default encoding will not cause much trouble with this, since it is an ASCII superset. However, changing it more than once will cause the earlier Unicode objects to still use the old default encoding value. Using a different non-ASCII compatible encoding, such as UTF-16, will cause breakage for the same reason. The default encoded string version of a Unicode object is cached in the object and never recreated after it has first been successfully encoded. When only changing the default encoding once and using UTF-8 as the new default encoding, you'll only run into the hash value problem. If that's not an issue for your application, e.g. you don't mix Unicode and string key objects in your dictionaries and don't rely on the special relationship between hashes and comparisons elsewhere, you should be fine.
To get the job done :-) You could rewrite setencoding() to get the encoding information from e.g. an os.environ variable or some config file. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Aug 27 2009)
::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/
participants (9)
-
"Martin v. Löwis"
-
Barry Warsaw
-
Chris Withers
-
exarkun@twistedmatrix.com
-
Guido van Rossum
-
M.-A. Lemburg
-
Robert Brewer
-
Robert Kern
-
Stephen J. Turnbull