[I18n-sig] New Unicode default encoding scheme

M.-A. Lemburg mal@lemburg.com
Fri, 09 Jun 2000 13:09:19 +0200


Hi everybody,

I just wanted to inform you that the Unicode default encoding
handling has changed from the strict UTF-8 setting to a
much more flexible solution which is based on the default
locale settings (provided via the LANG environment variable).
The new default setting is ASCII as per Guido's request.

Here's the important section of the Misc/unicode.txt file.
For more details I refer you to reading that file from
the current CVS tree.

"""
Unicode Default Encoding:
-------------------------

The Unicode implementation has to make some assumption about the
encoding of 8-bit strings passed to it for coercion and about the
encoding to as default for conversion of Unicode to strings when no
specific encoding is given. This encoding is called <default encoding>
throughout this text.

If not otherwise defined or set, the <default encoding> is set to
'ascii'.

For this, the implementation maintains a global which can be set in
the site.py Python startup script. Subsequent changes are not
possible. The <default encoding> can be set and queried using the
two sys module APIs:

  sys.setdefaultencoding(encoding)
     --> Sets the <default encoding> used by the Unicode implementation.
         encoding has to be an encoding which is supported by the Python
         installation, otherwise, a LookupError is raised. Note: This API
         is only available in site.py !

  sys.getdefaultencoding()
     --> Returns the current default encoding.

To enhance usability of Unicode coercion, the <default encoding> is
set in the default site.py startup module according to the encoding
defined by the locale active when the site.py module gets executed.
The locale module is used to extract the encoding from the locale
default settings defined in the LANG environment variable (and
possibly others -- see locale.py). If the encoding cannot be
determined, is unkown or unsupported, site.py defaults to setting the
<default encoding> to 'ascii'. This encoding is also the startup
default of Python (and in effect before site.py is executed).
"""

Example:

cnri/Python+Unicode> setenv LANG de_DE:utf8
cnri/Python+Unicode> ./python 
>>> import sys
>>> sys.getdefaultencoding()
'utf'
>>> print u"äöü"
äöü
>>> 
cnri/Python+Unicode> setenv LANG de_DE:latin1
cnri/Python+Unicode> ./python
>>> import sys
>>> sys.getdefaultencoding()
'latin1'
>>> print u"äöü"
äöü
>>>

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/