[Python-Dev] global encoding?!? (was: [Python-checkins] ... unicodeobject.c)

M.-A. Lemburg mal@lemburg.com
Tue, 09 May 2000 23:35:16 +0200

Guido van Rossum wrote:
> > Umm... maybe I missed something, but I thought there was pretty broad
> > feelings *against* having a global like this. This kind of thing is just
> > nasty.
> >
> > 1) Python modules can't change it, nor can they rely on it being a
> >    particular value
> > 2) a mutable, global variable is just plain wrong. The InterpreterState
> >    and ThreadState structures were created *specifically* to avoid adding
> >    crap variables like this.
> > 3) allowing a default other than utf-8 is sure to cause gotchas and
> >    surprises. Some code is going to rightly assume that the default is
> >    just that, but be horribly broken when an application changes it.

Hmm, the patch notice says it all I guess:

This patch fixes a few bugglets and adds an experimental
feature which allows setting the string encoding assumed
by the Unicode implementation at run-time.

The current implementation uses a process global for
the string encoding. This should subsequently be changed
to a thread state variable, so that the setting can
be done on a per thread basis.

Note that only the coercions from strings to Unicode
are affected by the encoding parameter. The "s" parser
marker still returns UTF-8. (str(unicode) also returns
the string encoding -- unlike what I wrote in the original
patch notice.)

The main intent of this patch is to provide a test
bed for the ongoing Unicode debate, e.g. to have the
implementation use 'latin-1' as default string encoding,

import sys

in you site.py file.

> > Somebody please say this is hugely experimental. And then say why it isn't
> > just a private patch, rather than sitting in CVS.
> Watch your language.
> Marc did this at my request.  It is my intention that the encoding be
> hardcoded at compile time.  But while there's a discussion going about
> what the hardcoded encoding should *be*, it would seem handy to have a
> quick way to experiment.

Right and that's what the intent was behind adding a global
and some APIs to change it first... there are a few ways this
could one day get finalized:

1. hardcode the encoding (UTF-8 was previously hard-coded)
2. make the encoding a compile time option
3. make the encoding a per-process option
4. make the encoding a per-thread option
5. make the encoding a per-process setting which is deduced
   from env. vars such as LC_ALL, LC_CTYPE, LANG or system
   APIs which can be used to get at the currently
   active local encoding

Note that I have named the APIs sys.get/set_string_encoding()...
I've done that on purpose, because I have a feeling that
changing the conversion from Unicode to strings from UTF-8
to an encoding not capable of representing all Unicode
characters won't get us very far. Also, changing this is
rather tricky due to the way the buffer API works.

The other way around needs some experimenting though and this
is what the patch implements: it allows you to change the
string encoding assumption to test various
possibilities, e.g. ascii, latin-1, unicode-escape,
<your favourite local encoding> etc. without having to
recompile the interpreter every time.

Have fun with it :-)

Marc-Andre Lemburg
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/