[Python-Dev] String encoding

M.-A. Lemburg mal@lemburg.com
Tue, 23 May 2000 12:10:20 +0200


The recent discussion about repr() et al. brought up the idea
of a locale based string encoding again.

A support module for querying the encoding used in the current
locale together with the experimental hook to set the string
encoding could yield a compromise which satisfies ASCII, Latin-1
and UTF-8 proponents.

The idea is to use the site.py module to customize the interpreter
from within Python (rather than making the encoding a compile
time option). This is easily doable using the (yet to be written)
support module and the sys.setstringencoding() hook.

The default encoding would be 'ascii' and could then be changed
to whatever the user or administrator wants it to be on a per
site basis. Furthermore, the encoding should be settable on
a per thread basis inside the interpreter (Python threads
do not seem to inherit any per-thread globals, so the
encoding would have to be set for all new threads).

E.g. a site.py module could look like this:

"""
import locale,sys

# Get encoding, defaulting to 'ascii' in case it cannot be
# determined
defenc = locale.get_encoding('ascii')

# Set main thread's string encoding
sys.setstringencoding(defenc)

This would result in the Unicode implementation to assume
defenc as encoding of strings.
"""

Minor nit: due to the implementation, the C parser markers
"s" and "t" and the hash() value calculation will still need
to work with a fixed encoding which still is UTF-8. C APIs
which want to support Unicode should be fixed to use "es"
or query the object directly and then apply proper, possibly
OS dependent conversion.

Before starting off into implementing the above, I'd like to
hear some comments...

Thanks,
-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/