[Patches] Unicode patch set 2000-06-02

M.-A. Lemburg mal@lemburg.com
Fri, 02 Jun 2000 14:57:06 +0200


This is a multi-part message in MIME format.
--------------BEC51DD4A0A70FF5B2F902AD
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: 8bit

This patch set moves the Unicode implementation into the direction
put forward in the recent default encoding discussions
on python-dev.

Discussion:
-----------

The patch implements the new strategy to use ASCII as default
encoding assumption whenever 8-bit strings and Unicode meet.
Strings are assumed to be encoded in 7-bit ASCII when being
coerced to Unicode. Characters having the top bit set will
cause an exception to be raised.

To enhance flexibility and provide better means of customization,
the patch adds to this default setting the ability to set
the default encoding depending on the current locale set on the
machine running Python.

The locale.py module was extended with a locale aliasing engine
which not only knows about many commonly used locale names, but
also defines default encodings for these. The default encoding
is now set according to the values obtained from the locale.py
get_default() API in site.py. If no locale is set or the locale
is unkown, site.py will again default to ASCII.

All of the above in accordance with Guido's proposal to
choose ASCII as default encoding and to complement this default
with locale awareness. Note that the global is not a per thread
setting -- it may only be modified in site.py.

Patch Set Contents:
-------------------

Modules/Setup.in:

The locale module is turned on per default.

Objects/unicodeobject.c:

Change the default encoding to 'ascii' (it was previously
defined as UTF-8).

Note: The implementation still uses UTF-8 to implement
the buffer protocol, so C APIs will still see UTF-8. This
is on purpose: rather than fixing the Unicode implementation,
the C APIs should be made Unicode aware.

Python/sysmodule.c:

Changed the API names for setting the default encoding.
These are now in line with the other hooks API names
(no underscores).

Lib/encodings/aliases.py:

Added some more codec aliases. Some of them are needed by the
new locale.py encoding support.

Lib/encodings/undefined.py:

New codec which always raises an exception when used. This
codec can be used to effectively switch off string coercion
to Unicode.

Lib/locale.py:

New locale name aliasing engine by Marc-André Lemburg
(mal@lemburg.com). The engine also supports specifying
locale encodings, a feature which is used by the new
default encoding support in site.py.

Lib/site.py:

Added support to set the default encoding of strings
at startup time to the values defined by the C locale.

The sys.setdefaultencoding() API is deleted after having
set up the encoding, so that user code cannot subsequentely
change the setting. This effectively means that only site.py
may alter the default setting.

_____________________________________________________________________
License Transfer:

I confirm that, to the best of my knowledge and belief, this
contribution is free of any claims of third parties under copyright,
patent or other rights or interests ("claims").  To the extent that I
have any such claims, I hereby grant to CNRI a nonexclusive,
irrevocable, royalty-free, worldwide license to reproduce, distribute,
perform and/or display publicly, prepare derivative versions, and
otherwise use this contribution as part of the Python software and its
related documentation, or any derivative versions thereof, at no cost
to CNRI or its licensed users, and to authorize others to do so.

I acknowledge that CNRI may, at its sole discretion, decide whether or
not to incorporate this contribution in the Python software and its
related documentation.  I further grant CNRI permission to use my name
and other identifying information provided to CNRI by me for use in
connection with the Python software and its related documentation.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/
--------------BEC51DD4A0A70FF5B2F902AD
Content-Type: text/plain; charset=us-ascii;
 name="Unicode-Implementation-2000-06-02.patch"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline;
 filename="Unicode-Implementation-2000-06-02.patch"

Only in CVS-Python: .cvsignore
diff -u -rbP -x *.o -x *.pyc -x Makefile -x *~ -x *.so -x add2lib -x pgen -x buildno -x core -x config.* -x libpython* -x python -x Setup -x Setup.local -x Setup.thread -x hassignal -x Makefile.pre -x configure -x *.bak -x *.s -x DEADJOE -x *.rej -x *.orig -x Demo -x CVS -x Doc -x *.orig -x .#* -x distutils -x PCbuild -x *.py -x ACKS -x *.txt -x README CVS-Python/Modules/Setup.in Python+Unicode/Modules/Setup.in
--- CVS-Python/Modules/Setup.in	Thu May  4 00:34:12 2000
+++ Python+Unicode/Modules/Setup.in	Sat May 27 18:21:46 2000
@@ -140,7 +140,7 @@
 unicodedata unicodedata.c unicodedatabase.c
                         # static Unicode character database
 
-#_locale _localemodule.c  # access to ISO C locale support
+_locale _localemodule.c  # access to ISO C locale support
 
 
 # Modules with some UNIX dependencies -- on by default:
diff -u -rbP -x *.o -x *.pyc -x Makefile -x *~ -x *.so -x add2lib -x pgen -x buildno -x core -x config.* -x libpython* -x python -x Setup -x Setup.local -x Setup.thread -x hassignal -x Makefile.pre -x configure -x *.bak -x *.s -x DEADJOE -x *.rej -x *.orig -x Demo -x CVS -x Doc -x *.orig -x .#* -x distutils -x PCbuild -x *.py -x ACKS -x *.txt -x README CVS-Python/Objects/unicodeobject.c Python+Unicode/Objects/unicodeobject.c
--- CVS-Python/Objects/unicodeobject.c	Tue May  9 21:54:43 2000
+++ Python+Unicode/Objects/unicodeobject.c	Fri Jun  2 13:54:42 2000
@@ -4710,7 +4710,7 @@
 
     /* Init the implementation */
     unicode_empty = _PyUnicode_New(0);
-    strcpy(unicode_default_encoding, "utf-8");
+    strcpy(unicode_default_encoding, "ascii");
 }
 
 /* Finalize the Unicode implementation */
diff -u -rbP -x *.o -x *.pyc -x Makefile -x *~ -x *.so -x add2lib -x pgen -x buildno -x core -x config.* -x libpython* -x python -x Setup -x Setup.local -x Setup.thread -x hassignal -x Makefile.pre -x configure -x *.bak -x *.s -x DEADJOE -x *.rej -x *.orig -x Demo -x CVS -x Doc -x *.orig -x .#* -x distutils -x PCbuild -x *.py -x ACKS -x *.txt -x README CVS-Python/Python/sysmodule.c Python+Unicode/Python/sysmodule.c
--- CVS-Python/Python/sysmodule.c	Tue May  9 21:57:01 2000
+++ Python+Unicode/Python/sysmodule.c	Fri Jun  2 13:56:18 2000
@@ -143,28 +143,28 @@
 exit status will be one (i.e., failure).";
 
 static PyObject *
-sys_get_string_encoding(self, args)
+sys_getdefaultencoding(self, args)
 	PyObject *self;
 	PyObject *args;
 {
-	if (!PyArg_ParseTuple(args, ":get_string_encoding"))
+	if (!PyArg_ParseTuple(args, ":getdefaultencoding"))
 		return NULL;
 	return PyString_FromString(PyUnicode_GetDefaultEncoding());
 }
 
-static char get_string_encoding_doc[] =
-"get_string_encoding() -> string\n\
+static char getdefaultencoding_doc[] =
+"getdefaultencoding() -> string\n\
 \n\
 Return the current default string encoding used by the Unicode \n\
 implementation.";
 
 static PyObject *
-sys_set_string_encoding(self, args)
+sys_setdefaultencoding(self, args)
 	PyObject *self;
 	PyObject *args;
 {
 	char *encoding;
-	if (!PyArg_ParseTuple(args, "s:set_string_encoding", &encoding))
+	if (!PyArg_ParseTuple(args, "s:setdefaultencoding", &encoding))
 		return NULL;
 	if (PyUnicode_SetDefaultEncoding(encoding))
 	    	return NULL;
@@ -172,8 +172,8 @@
 	return Py_None;
 }
 
-static char set_string_encoding_doc[] =
-"set_string_encoding(encoding)\n\
+static char setdefaultencoding_doc[] =
+"setdefaultencoding(encoding)\n\
 \n\
 Set the current default string encoding used by the Unicode implementation.";
 
@@ -301,7 +301,7 @@
 	/* Might as well keep this in alphabetic order */
 	{"exc_info",	sys_exc_info, 1, exc_info_doc},
 	{"exit",	sys_exit, 0, exit_doc},
-	{"get_string_encoding", sys_get_string_encoding, 1, get_string_encoding_doc},
+	{"getdefaultencoding", sys_getdefaultencoding, 1, getdefaultencoding_doc},
 #ifdef COUNT_ALLOCS
 	{"getcounts",	sys_getcounts, 1},
 #endif
@@ -315,7 +315,7 @@
 #ifdef USE_MALLOPT
 	{"mdebug",	sys_mdebug, 1},
 #endif
-	{"set_string_encoding", sys_set_string_encoding, 1, set_string_encoding_doc},
+	{"setdefaultencoding", sys_setdefaultencoding, 1, setdefaultencoding_doc},
 	{"setcheckinterval",	sys_setcheckinterval, 1, setcheckinterval_doc},
 	{"setprofile",	sys_setprofile, 0, setprofile_doc},
 	{"settrace",	sys_settrace, 0, settrace_doc},
Only in CVS-Python: .cvsignore
diff -u -rP -x *.o -x *.pyc -x Makefile -x *~ -x *.so -x add2lib -x pgen -x buildno -x core -x config.* -x libpython* -x python -x Setup -x Setup.local -x Setup.thread -x hassignal -x Makefile.pre -x configure -x *.bak -x *.s -x DEADJOE -x *.rej -x *.orig -x Demo -x CVS -x Doc -x *.orig -x .#* -x distutils -x PCbuild -x *.c -x *.h -x *.in -x output CVS-Python/Lib/encodings/aliases.py Python+Unicode/Lib/encodings/aliases.py
--- CVS-Python/Lib/encodings/aliases.py	Wed Apr  5 22:11:18 2000
+++ Python+Unicode/Lib/encodings/aliases.py	Tue May 30 19:52:23 2000
@@ -18,6 +18,8 @@
     'utf': 'utf_8',
     'utf8': 'utf_8',
     'u8': 'utf_8',
+    'utf8@ucs2': 'utf_8',
+    'utf8@ucs4': 'utf_8',
     
     # UTF-16
     'utf16': 'utf_16',
@@ -31,6 +33,8 @@
     'us_ascii': 'ascii',
 
     # ISO
+    '8859': 'latin_1',
+    'iso8859': 'latin_1',
     'iso8859_1': 'latin_1',
     'iso_8859_1': 'latin_1',
     'iso_8859_10': 'iso8859_10',
@@ -47,6 +51,7 @@
     'iso_8859_9': 'iso8859_9',
 
     # Mac
+    'maclatin2': 'mac_latin2',
     'maccentraleurope': 'mac_latin2',
     'maccyrillic': 'mac_cyrillic',
     'macgreek': 'mac_greek',
@@ -56,5 +61,22 @@
 
     # MBCS
     'dbcs': 'mbcs',
+
+    # Code pages
+    '437': 'cp437',
+
+    # CJK
+    #
+    # The codecs for these encodings are not distributed with the
+    # Python core, but are included here for reference, since the
+    # locale module relies on having these aliases available.
+    #
+    'jis_7': 'jis_7',
+    'iso_2022_jp': 'jis_7',
+    'ujis': 'euc_jp',
+    'ajec': 'euc_jp',
+    'eucjp': 'euc_jp',
+    'tis260': 'tactis',
+    'sjis': 'shift_jis',
 
 }
diff -u -rP -x *.o -x *.pyc -x Makefile -x *~ -x *.so -x add2lib -x pgen -x buildno -x core -x config.* -x libpython* -x python -x Setup -x Setup.local -x Setup.thread -x hassignal -x Makefile.pre -x configure -x *.bak -x *.s -x DEADJOE -x *.rej -x *.orig -x Demo -x CVS -x Doc -x *.orig -x .#* -x distutils -x PCbuild -x *.c -x *.h -x *.in -x output CVS-Python/Lib/encodings/undefined.py Python+Unicode/Lib/encodings/undefined.py
--- CVS-Python/Lib/encodings/undefined.py	Thu Jan  1 01:00:00 1970
+++ Python+Unicode/Lib/encodings/undefined.py	Sat May 27 18:14:23 2000
@@ -0,0 +1,34 @@
+""" Python 'undefined' Codec
+
+    This codec will always raise a ValueError exception when being
+    used. It is intended for use by the site.py file to switch off
+    automatic string to Unicode coercion.
+
+Written by Marc-Andre Lemburg (mal@lemburg.com).
+
+(c) Copyright CNRI, All Rights Reserved. NO WARRANTY.
+
+"""
+import codecs
+
+### Codec APIs
+
+class Codec(codecs.Codec):
+
+    def encode(self,input,errors='strict'):
+        raise UnicodeError, "undefined encoding"
+
+    def decode(self,input,errors='strict'):
+        raise UnicodeError, "undefined encoding"
+
+class StreamWriter(Codec,codecs.StreamWriter):
+    pass
+        
+class StreamReader(Codec,codecs.StreamReader):
+    pass
+
+### encodings module API
+
+def getregentry():
+
+    return (Codec().encode,Codec().decode,StreamReader,StreamWriter)
diff -u -rP -x *.o -x *.pyc -x Makefile -x *~ -x *.so -x add2lib -x pgen -x buildno -x core -x config.* -x libpython* -x python -x Setup -x Setup.local -x Setup.thread -x hassignal -x Makefile.pre -x configure -x *.bak -x *.s -x DEADJOE -x *.rej -x *.orig -x Demo -x CVS -x Doc -x *.orig -x .#* -x distutils -x PCbuild -x *.c -x *.h -x *.in -x output CVS-Python/Lib/locale.py Python+Unicode/Lib/locale.py
--- CVS-Python/Lib/locale.py	Fri Feb  4 16:39:29 2000
+++ Python+Unicode/Lib/locale.py	Fri Jun  2 14:28:00 2000
@@ -1,10 +1,26 @@
-"""Support for number formatting using the current locale settings."""
+""" Locale support.
 
-# Author: Martin von Loewis
+    The module provides low-level access to the C lib's locale APIs
+    and adds high level number formatting APIs as well as a locale
+    aliasing engine to complement these.
+
+    The aliasing engine includes support for many commonly used locale
+    names and maps them to values suitable for passing to the C lib's
+    setlocale() function. It also includes default encodings for all
+    supported locale names.
+
+"""
 
-from _locale import *
 import string
 
+### C lib locale APIs
+
+from _locale import *
+
+### Number formatting APIs
+
+# Author: Martin von Loewis
+
 #perform the grouping from right to left
 def _group(s):
     conv=localeconv()
@@ -25,7 +41,9 @@
         else:
             result=s[-group:]
         s=s[:-group]
-    if s and result:
+    if not result:
+        return s
+    if s:
         result=s+conv['thousands_sep']+result
     return result
 
@@ -34,7 +52,7 @@
     but takes the current locale into account. 
     Grouping is applied if the third parameter is true."""
     result = f % val
-    fields = string.splitfields(result,".")
+    fields = string.split(result, ".")
     if grouping:
         fields[0]=_group(fields[0])
     if len(fields)==2:
@@ -51,11 +69,15 @@
 def atof(str,func=string.atof):
     "Parses a string as a float according to the locale settings."
     #First, get rid of the grouping
-    s=string.splitfields(str,localeconv()['thousands_sep'])
-    str=string.join(s,"")
+    ts = localeconv()['thousands_sep']
+    if ts:
+        s=string.split(str,ts)
+        str=string.join(s, "")
     #next, replace the decimal point with a dot
-    s=string.splitfields(str,localeconv()['decimal_point'])
-    str=string.join(s,'.')
+    dd = localeconv()['decimal_point']
+    if dd:
+        s=string.split(str,dd)
+        str=string.join(s,'.')
     #finally, parse the string
     return func(str)
 
@@ -63,7 +85,7 @@
     "Converts a string to an integer according to the locale settings."
     return atof(str,string.atoi)
 
-def test():
+def _test():
     setlocale(LC_ALL,"")
     #do grouping
     s1=format("%d",123456789,1)
@@ -71,7 +93,479 @@
     #standard formatting
     s1=str(3.14)
     print s1,"is",atof(s1)
+
+### Locale name aliasing engine
+
+# Author: Marc-Andre Lemburg, mal@lemburg.com
+
+def normalize(localename):
+
+    """ Returns a normalized locale code for the given locale
+        name.
+
+        The returned locale code is formatted for use with
+        setlocale().
+
+        If normalization fails, the original name is returned
+        unchanged.
+
+        If the given encoding is not known, the function defaults to
+        the default encoding for the locale code just like setlocale()
+        does.
+
+    """
+    # Normalize the locale name and extract the encoding
+    fullname = string.lower(localename)
+    if ':' in fullname:
+        # ':' is sometimes used as encoding delimiter.
+        fullname = string.replace(fullname, ':', '.')
+    if '.' in fullname:
+        langname, encoding = string.split(fullname, '.')[:2]
+        fullname = langname + '.' + encoding
+    else:
+        langname = fullname
+        encoding = ''
+
+    # First lookup: fullname (possibly with encoding)
+    code = locale_alias.get(fullname, None)
+    if code is not None:
+        return code
+
+    # Second try: langname (without encoding)
+    code = locale_alias.get(langname, None)
+    if code is not None:
+        if '.' in code:
+            langname, defenc = string.split(code, '.')
+        else:
+            langname = code
+            defenc = ''
+        if encoding:
+            encoding = encoding_alias.get(encoding, encoding)
+        else:
+            encoding = defenc
+        if encoding:
+            return langname + '.' + encoding
+        else:
+            return langname
+
+    else:
+        return localename
+
+def _parse_localename(localename):
+
+    """ Parses the locale code for localename and returns the
+        result as tuple (language code, encoding).
+
+        The localename is normalized and passed through the locale
+        alias engine. A ValueError is raised in case the locale name
+        cannot be parsed.
+
+        The language code corresponds to RFC 1766.  code and encoding
+        can be None in case the values cannot be determined or are
+        unkown to this implementation.
+
+    """
+    code = normalize(localename)
+    if '.' in code:
+        return string.split(code, '.')[:2]
+    elif code == 'C':
+        return None, None
+    else:
+        raise ValueError,'unkown locale: %s' % localename
+    return l
+
+def _build_localename(localetuple):
+
+    """ Builds a locale code from the given tuple (language code,
+        encoding).
+
+        No aliasing or normalizing takes place.
+
+    """
+    language, encoding = localetuple
+    if language is None:
+        language = 'C'
+    if encoding is None:
+        return language
+    else:
+        return language + '.' + encoding
+    
+def get_default(envvars=('LANGUAGE', 'LC_ALL', 'LC_CTYPE', 'LANG')):
+
+    """ Tries to determine the default locale settings and returns
+        them as tuple (language code, encoding).
+
+        According to POSIX, a program which has not called
+        setlocale(LC_ALL,"") runs using the portable 'C' locale.
+        Calling setlocale(LC_ALL,"") lets it use the default locale as
+        defined by the LANG variable. Since we don't want to interfere
+        with the current locale setting we thus emulate the behaviour
+        in the way described above.
+
+        To maintain compatibility with other platforms, not only the
+        LANG variable is tested, but a list of variables given as
+        envvars parameter. The first found to be defined will be
+        used. envvars defaults to the search path used in GNU gettext;
+        it must always contain the variable name 'LANG'.
+
+        Except for the code 'C', the language code corresponds to RFC
+        1766.  code and encoding can be None in case the values cannot
+        be determined.
+
+    """
+    import os
+    lookup = os.environ.get
+    for variable in envvars:
+        localename = lookup(variable,None)
+        if localename is not None:
+            break
+    else:
+        localename = 'C'
+    return _parse_localename(localename)
+
+def get_locale(category=LC_CTYPE):
+
+    """ Returns the current setting for the given locale category as
+        tuple (language code, encoding).
+
+        category may be one of the LC_* value except LC_ALL. It
+        defaults to LC_CTYPE.
+
+        Except for the code 'C', the language code corresponds to RFC
+        1766.  code and encoding can be None in case the values cannot
+        be determined.
+
+    """
+    localename = setlocale(category)
+    if category == LC_ALL and ';' in localename:
+        raise TypeError,'category LC_ALL is not supported'
+    return _parse_localename(localename)
+
+def set_locale(localetuple, category=LC_ALL):
+
+    """ Set the locale according to the localetuple (language code,
+        encoding) as returned by get_locale() and get_default().
+
+        The given codes are passed through the locale aliasing engine
+        before being given to setlocale() for processing.
+
+        category may be given as one of the LC_* values. It defaults
+        to LC_ALL.
+
+    """
+    setlocale(category, normalize(_build_localename(localetuple)))
+
+def set_to_default(category=LC_ALL):
+
+    """ Sets the locale for category to the default setting.
+
+        The default setting is determined by calling
+        get_default(). category defaults to LC_ALL.
+        
+    """
+    setlocale(category, _build_localename(get_default()))
+
+### Database
+#
+# The following data was extracted from the locale.alias file which
+# comes with X11 and then hand edited removing the explicit encoding
+# definitions and adding some more aliases. The file is usually
+# available as /usr/lib/X11/locale/locale.alias.
+#    
+
+#
+# The encoding_alias table maps lowercase encoding alias names to C
+# locale encoding names (case-sensitive).
+#
+encoding_alias = {
+        '437': 				'C',
+        'c': 				'C',
+        'iso8859': 			'ISO8859-1',
+        '8859': 			'ISO8859-1',
+        '88591': 			'ISO8859-1',
+        'ascii': 			'ISO8859-1',
+        'en': 				'ISO8859-1',
+        'iso88591': 			'ISO8859-1',
+        'iso_8859-1': 			'ISO8859-1',
+        '885915': 			'ISO8859-15',
+        'iso885915': 			'ISO8859-15',
+        'iso_8859-15': 			'ISO8859-15',
+        'iso8859-2': 			'ISO8859-2',
+        'iso88592': 			'ISO8859-2',
+        'iso_8859-2': 			'ISO8859-2',
+        'iso88595': 			'ISO8859-5',
+        'iso88596': 			'ISO8859-6',
+        'iso88597': 			'ISO8859-7',
+        'iso88598': 			'ISO8859-8',
+        'iso88599': 			'ISO8859-9',
+        'iso-2022-jp': 			'JIS7',
+        'jis': 				'JIS7',
+        'jis7': 			'JIS7',
+        'sjis': 			'SJIS',
+        'tis620': 			'TACTIS',
+        'ajec': 			'eucJP',
+        'eucjp': 			'eucJP',
+        'ujis': 			'eucJP',
+        'utf-8': 			'utf',
+        'utf8': 			'utf',
+        'utf8@ucs4': 			'utf',
+}
+
+#    
+# The locale_alias table maps lowercase alias names to C locale names
+# (case-sensitive). Encodings are always separated from the locale
+# name using a dot ('.'); they should only be given in case the
+# language name is needed to interpret the given encoding alias
+# correctly (CJK codes often have this need).
+#
+locale_alias = {
+        'american':                      'en_US.ISO8859-1',
+        'ar':                            'ar_AA.ISO8859-6',
+        'ar_aa':                         'ar_AA.ISO8859-6',
+        'ar_sa':                         'ar_SA.ISO8859-6',
+        'arabic':                        'ar_AA.ISO8859-6',
+        'bg':                            'bg_BG.ISO8859-5',
+        'bg_bg':                         'bg_BG.ISO8859-5',
+        'bulgarian':                     'bg_BG.ISO8859-5',
+        'c-french':                      'fr_CA.ISO8859-1',
+        'c':                             'C',
+        'c_c':                           'C',
+        'cextend':                       'en_US.ISO8859-1',
+        'chinese-s':                     'zh_CN.eucCN',
+        'chinese-t':                     'zh_TW.eucTW',
+        'croatian':                      'hr_HR.ISO8859-2',
+        'cs':                            'cs_CZ.ISO8859-2',
+        'cs_cs':                         'cs_CZ.ISO8859-2',
+        'cs_cz':                         'cs_CZ.ISO8859-2',
+        'cz':                            'cz_CZ.ISO8859-2',
+        'cz_cz':                         'cz_CZ.ISO8859-2',
+        'czech':                         'cs_CS.ISO8859-2',
+        'da':                            'da_DK.ISO8859-1',
+        'da_dk':                         'da_DK.ISO8859-1',
+        'danish':                        'da_DK.ISO8859-1',
+        'de':                            'de_DE.ISO8859-1',
+        'de_at':                         'de_AT.ISO8859-1',
+        'de_ch':                         'de_CH.ISO8859-1',
+        'de_de':                         'de_DE.ISO8859-1',
+        'dutch':                         'nl_BE.ISO8859-1',
+        'ee':                            'ee_EE.ISO8859-4',
+        'el':                            'el_GR.ISO8859-7',
+        'el_gr':                         'el_GR.ISO8859-7',
+        'en':                            'en_US.ISO8859-1',
+        'en_au':                         'en_AU.ISO8859-1',
+        'en_ca':                         'en_CA.ISO8859-1',
+        'en_gb':                         'en_GB.ISO8859-1',
+        'en_ie':                         'en_IE.ISO8859-1',
+        'en_nz':                         'en_NZ.ISO8859-1',
+        'en_uk':                         'en_GB.ISO8859-1',
+        'en_us':                         'en_US.ISO8859-1',
+        'eng_gb':                        'en_GB.ISO8859-1',
+        'english':                       'en_EN.ISO8859-1',
+        'english_uk':                    'en_GB.ISO8859-1',
+        'english_united-states':         'en_US.ISO8859-1',
+        'english_us':                    'en_US.ISO8859-1',
+        'es':                            'es_ES.ISO8859-1',
+        'es_ar':                         'es_AR.ISO8859-1',
+        'es_bo':                         'es_BO.ISO8859-1',
+        'es_cl':                         'es_CL.ISO8859-1',
+        'es_co':                         'es_CO.ISO8859-1',
+        'es_cr':                         'es_CR.ISO8859-1',
+        'es_ec':                         'es_EC.ISO8859-1',
+        'es_es':                         'es_ES.ISO8859-1',
+        'es_gt':                         'es_GT.ISO8859-1',
+        'es_mx':                         'es_MX.ISO8859-1',
+        'es_ni':                         'es_NI.ISO8859-1',
+        'es_pa':                         'es_PA.ISO8859-1',
+        'es_pe':                         'es_PE.ISO8859-1',
+        'es_py':                         'es_PY.ISO8859-1',
+        'es_sv':                         'es_SV.ISO8859-1',
+        'es_uy':                         'es_UY.ISO8859-1',
+        'es_ve':                         'es_VE.ISO8859-1',
+        'et':                            'et_EE.ISO8859-4',
+        'et_ee':                         'et_EE.ISO8859-4',
+        'fi':                            'fi_FI.ISO8859-1',
+        'fi_fi':                         'fi_FI.ISO8859-1',
+        'finnish':                       'fi_FI.ISO8859-1',
+        'fr':                            'fr_FR.ISO8859-1',
+        'fr_be':                         'fr_BE.ISO8859-1',
+        'fr_ca':                         'fr_CA.ISO8859-1',
+        'fr_ch':                         'fr_CH.ISO8859-1',
+        'fr_fr':                         'fr_FR.ISO8859-1',
+        'fre_fr':                        'fr_FR.ISO8859-1',
+        'french':                        'fr_FR.ISO8859-1',
+        'french_france':                 'fr_FR.ISO8859-1',
+        'ger_de':                        'de_DE.ISO8859-1',
+        'german':                        'de_DE.ISO8859-1',
+        'german_germany':                'de_DE.ISO8859-1',
+        'greek':                         'el_GR.ISO8859-7',
+        'hebrew':                        'iw_IL.ISO8859-8',
+        'hr':                            'hr_HR.ISO8859-2',
+        'hr_hr':                         'hr_HR.ISO8859-2',
+        'hu':                            'hu_HU.ISO8859-2',
+        'hu_hu':                         'hu_HU.ISO8859-2',
+        'hungarian':                     'hu_HU.ISO8859-2',
+        'icelandic':                     'is_IS.ISO8859-1',
+        'id':                            'id_ID.ISO8859-1',
+        'id_id':                         'id_ID.ISO8859-1',
+        'is':                            'is_IS.ISO8859-1',
+        'is_is':                         'is_IS.ISO8859-1',
+        'iso-8859-1':                    'en_US.ISO8859-1',
+        'iso-8859-15':                   'en_US.ISO8859-15',
+        'iso8859-1':                     'en_US.ISO8859-1',
+        'iso8859-15':                    'en_US.ISO8859-15',
+        'iso_8859_1':                    'en_US.ISO8859-1',
+        'iso_8859_15':                   'en_US.ISO8859-15',
+        'it':                            'it_IT.ISO8859-1',
+        'it_ch':                         'it_CH.ISO8859-1',
+        'it_it':                         'it_IT.ISO8859-1',
+        'italian':                       'it_IT.ISO8859-1',
+        'iw':                            'iw_IL.ISO8859-8',
+        'iw_il':                         'iw_IL.ISO8859-8',
+        'ja':                            'ja_JP.eucJP',
+        'ja.jis':                        'ja_JP.JIS7',
+        'ja.sjis':                       'ja_JP.SJIS',
+        'ja_jp':                         'ja_JP.eucJP',
+        'ja_jp.ajec':                    'ja_JP.eucJP',
+        'ja_jp.euc':                     'ja_JP.eucJP',
+        'ja_jp.eucjp':                   'ja_JP.eucJP',
+        'ja_jp.iso-2022-jp':             'ja_JP.JIS7',
+        'ja_jp.jis':                     'ja_JP.JIS7',
+        'ja_jp.jis7':                    'ja_JP.JIS7',
+        'ja_jp.mscode':                  'ja_JP.SJIS',
+        'ja_jp.sjis':                    'ja_JP.SJIS',
+        'ja_jp.ujis':                    'ja_JP.eucJP',
+        'japan':                         'ja_JP.eucJP',
+        'japanese':                      'ja_JP.SJIS',
+        'japanese-euc':                  'ja_JP.eucJP',
+        'japanese.euc':                  'ja_JP.eucJP',
+        'jp_jp':                         'ja_JP.eucJP',
+        'ko':                            'ko_KR.eucKR',
+        'ko_kr':                         'ko_KR.eucKR',
+        'ko_kr.euc':                     'ko_KR.eucKR',
+        'korean':                        'ko_KR.eucKR',
+        'lt':                            'lt_LT.ISO8859-4',
+        'lv':                            'lv_LV.ISO8859-4',
+        'mk':                            'mk_MK.ISO8859-5',
+        'mk_mk':                         'mk_MK.ISO8859-5',
+        'nl':                            'nl_NL.ISO8859-1',
+        'nl_be':                         'nl_BE.ISO8859-1',
+        'nl_nl':                         'nl_NL.ISO8859-1',
+        'no':                            'no_NO.ISO8859-1',
+        'no_no':                         'no_NO.ISO8859-1',
+        'norwegian':                     'no_NO.ISO8859-1',
+        'pl':                            'pl_PL.ISO8859-2',
+        'pl_pl':                         'pl_PL.ISO8859-2',
+        'polish':                        'pl_PL.ISO8859-2',
+        'portuguese':                    'pt_PT.ISO8859-1',
+        'portuguese_brazil':             'pt_BR.ISO8859-1',
+        'posix':                         'C',
+        'posix-utf2':                    'C',
+        'pt':                            'pt_PT.ISO8859-1',
+        'pt_br':                         'pt_BR.ISO8859-1',
+        'pt_pt':                         'pt_PT.ISO8859-1',
+        'ro':                            'ro_RO.ISO8859-2',
+        'ro_ro':                         'ro_RO.ISO8859-2',
+        'ru':                            'ru_RU.ISO8859-5',
+        'ru_ru':                         'ru_RU.ISO8859-5',
+        'rumanian':                      'ro_RO.ISO8859-2',
+        'russian':                       'ru_RU.ISO8859-5',
+        'serbocroatian':                 'sh_YU.ISO8859-2',
+        'sh':                            'sh_YU.ISO8859-2',
+        'sh_hr':                         'sh_HR.ISO8859-2',
+        'sh_sp':                         'sh_YU.ISO8859-2',
+        'sh_yu':                         'sh_YU.ISO8859-2',
+        'sk':                            'sk_SK.ISO8859-2',
+        'sk_sk':                         'sk_SK.ISO8859-2',
+        'sl':                            'sl_CS.ISO8859-2',
+        'sl_cs':                         'sl_CS.ISO8859-2',
+        'sl_si':                         'sl_SI.ISO8859-2',
+        'slovak':                        'sk_SK.ISO8859-2',
+        'slovene':                       'sl_CS.ISO8859-2',
+        'sp':                            'sp_YU.ISO8859-5',
+        'sp_yu':                         'sp_YU.ISO8859-5',
+        'spanish':                       'es_ES.ISO8859-1',
+        'spanish_spain':                 'es_ES.ISO8859-1',
+        'sr_sp':                         'sr_SP.ISO8859-2',
+        'sv':                            'sv_SE.ISO8859-1',
+        'sv_se':                         'sv_SE.ISO8859-1',
+        'swedish':                       'sv_SE.ISO8859-1',
+        'th_th':                         'th_TH.TACTIS',
+        'tr':                            'tr_TR.ISO8859-9',
+        'tr_tr':                         'tr_TR.ISO8859-9',
+        'turkish':                       'tr_TR.ISO8859-9',
+        'univ':                          'en_US.utf',
+        'universal':                     'en_US.utf',
+        'zh':                            'zh_CN.eucCN',
+        'zh_cn':                         'zh_CN.eucCN',
+        'zh_cn.big5':                    'zh_TW.eucTW',
+        'zh_cn.euc':                     'zh_CN.eucCN',
+        'zh_tw':                         'zh_TW.eucTW',
+        'zh_tw.euc':                     'zh_TW.eucTW',
+}
+
+def _print_locale():
+
+    """ Test function.
+    """
+    categories = {}
+    def _init_categories(categories=categories):
+        for k,v in globals().items():
+            if k[:3] == 'LC_':
+                categories[k] = v
+    _init_categories()
+    del categories['LC_ALL']
+
+    print 'Locale defaults as determined by get_default():'
+    print '-'*72
+    lang, enc = get_default()
+    print 'Language: ', lang or '(undefined)'
+    print 'Encoding: ', enc or '(undefined)'
+    print
+
+    print 'Locale settings on startup:'
+    print '-'*72
+    for name,category in categories.items():
+        print name,'...'
+        lang, enc = get_locale(category)
+        print '   Language: ', lang or '(undefined)'
+        print '   Encoding: ', enc or '(undefined)'
+        print
+
+    set_to_default()
+    print
+    print 'Locale settings after calling set_to_default():'
+    print '-'*72
+    for name,category in categories.items():
+        print name,'...'
+        lang, enc = get_locale(category)
+        print '   Language: ', lang or '(undefined)'
+        print '   Encoding: ', enc or '(undefined)'
+        print
+    
+    try:
+        setlocale(LC_ALL,"")
+    except:
+        print 'NOTE:'
+        print 'setlocale(LC_ALL,"") does not support the default locale'
+        print 'given in the OS environment variables.'
+    else:
+        print
+        print 'Locale settings after calling setlocale(LC_ALL,""):'
+        print '-'*72
+        for name,category in categories.items():
+            print name,'...'
+            lang, enc = get_locale(category)
+            print '   Language: ', lang or '(undefined)'
+            print '   Encoding: ', enc or '(undefined)'
+            print
     
+###
 
 if __name__=='__main__':
-    test()
+    print 'Locale aliasing:'
+    print
+    _print_locale()
+    print
+    print 'Number formatting:'
+    print
+    _test()
diff -u -rP -x *.o -x *.pyc -x Makefile -x *~ -x *.so -x add2lib -x pgen -x buildno -x core -x config.* -x libpython* -x python -x Setup -x Setup.local -x Setup.thread -x hassignal -x Makefile.pre -x configure -x *.bak -x *.s -x DEADJOE -x *.rej -x *.orig -x Demo -x CVS -x Doc -x *.orig -x .#* -x distutils -x PCbuild -x *.c -x *.h -x *.in -x output CVS-Python/Lib/site.py Python+Unicode/Lib/site.py
--- CVS-Python/Lib/site.py	Wed Nov 25 16:57:47 1998
+++ Python+Unicode/Lib/site.py	Fri Jun  2 14:53:59 2000
@@ -119,10 +119,45 @@
 __builtin__.quit = __builtin__.exit = exit
 del exit
 
+#
+# Set the string encoding used by the Unicode implementation to the
+# encoding used by the default locale of this system. If the default
+# encoding cannot be determined or is unkown, it defaults to 'ascii'.
+#
+def locale_aware_defaultencoding():
+    import locale
+    code, encoding = locale.get_default()
+    if encoding is None:
+        encoding = 'ascii'
+    try:
+        sys.setdefaultencoding(encoding)
+    except LookupError:
+        sys.setdefaultencoding('ascii')
+
+if 1:
+    # Enable to support locale aware default string encodings.
+    locale_aware_defaultencoding()
+elif 0:
+    # Enable to switch off string to Unicode coercion and implicit
+    # Unicode to string conversion.
+    sys.setdefaultencoding('undefined')
+elif 0:
+    # Enable to hard-code a site specific default string encoding.
+    sys.setdefaultencoding('ascii')
+
+#
+# Run custom site specific code, if available.
+#
 try:
-    import sitecustomize                # Run arbitrary site specific code
+    import sitecustomize
 except ImportError:
-    pass                                # No site customization module
+    pass
+
+#
+# Remove sys.setdefaultencoding() so that users cannot change the
+# encoding after initialization.
+#
+del sys.setdefaultencoding
 
 def _test():
     print "sys.path = ["

--------------BEC51DD4A0A70FF5B2F902AD--