[Patches] Unicode Patch Set 2000-05-09

M.-A. Lemburg mal@lemburg.com
Tue, 09 May 2000 20:59:32 +0200


This is a multi-part message in MIME format.
--------------80712B74A5371BBFE8498331
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit

This patch fixes a few bugglets and adds an experimental
feature which allows setting the string encoding assumed
by the Unicode implementation at run-time.

The current implementation uses a process global for
the string encoding. This should subsequently be changed
to a thread state variable, so that the setting can
be done on a per thread basis.

Note that only the coercions from strings to Unicode
are affected by the encoding parameter. str(unicode)
and the "s" parser marker still return UTF-8.

The main intent of this patch is to provide a test
bed for the ongoing Unicode debate, e.g. to have the
implementation use 'latin-1' as default string encoding,
put

import sys
sys.set_string_encoding('latin-1')

in you site.py file.


Patch Set Contents:
-------------------

Include/codecs.h:

Added documentation and the missing PyCodec_StreamWriter API.

Include/unicodeobject.h:

Added PyUnicode_GetDefaultEncoding() and 
PyUnicode_GetDefaultEncoding() APIs.

Modules/timemodule.c:

Fixed a bug due to a /* inside /*...*/. GCC doesn't like
this and bombs.

Objects/unicodeobject.c:

Added support for user settable default encodings. The
current implementation uses a per-process global which
defines the value of the encoding parameter in case it
is set to NULL (meaning: use the default encoding).

Fixed a core dump in PyUnicode_Format().

Python/bltinmodule.c:

Fixed docs according to the new behaviour (the Unicode
encoding is no longer fixed to UTF-8).

Python/codecs.c:

Moved some docs to the include file.

Added a NULL check to _PyCodec_Lookup() to make it
core dump safe.

Python/sysmodule.c:

Added APIs to allow setting and querying the system's
current string encoding: sys.set_string_encoding()
and sys.get_string_encoding().

Lib/test/test_unicode.py:

Added another test for string formatting (the one that
produced the core dump now fixed in unicodeobject.c).

Misc/unicode.txt:

Added a useful link to Markus Kuhn's Unicode and UTF-8
FAQ.

_____________________________________________________________________
License Transfer:

I confirm that, to the best of my knowledge and belief, this
contribution is free of any claims of third parties under copyright,
patent or other rights or interests ("claims").  To the extent that I
have any such claims, I hereby grant to CNRI a nonexclusive,
irrevocable, royalty-free, worldwide license to reproduce, distribute,
perform and/or display publicly, prepare derivative versions, and
otherwise use this contribution as part of the Python software and its
related documentation, or any derivative versions thereof, at no cost
to CNRI or its licensed users, and to authorize others to do so.

I acknowledge that CNRI may, at its sole discretion, decide whether or
not to incorporate this contribution in the Python software and its
related documentation.  I further grant CNRI permission to use my name
and other identifying information provided to CNRI by me for use in
connection with the Python software and its related documentation.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/
--------------80712B74A5371BBFE8498331
Content-Type: text/plain; charset=iso-8859-1;
 name="Unicode-Implementation-2000-05-09.patch"
Content-Transfer-Encoding: 8bit
Content-Disposition: inline;
 filename="Unicode-Implementation-2000-05-09.patch"

diff -u -rbP -x *.o -x *.pyc -x Makefile -x *~ -x *.so -x add2lib -x pgen -x buildno -x config.* -x libpython* -x python -x Setup -x Setup.local -x Setup.thread -x hassignal -x Makefile.pre -x configure -x *.bak -x *.s -x DEADJOE -x *.rej -x *.orig -x Demo -x CVS -x Doc -x *.orig -x .#* -x distutils -x PC -x PCbuild -x *.py -x ACKS -x *.txt -x README CVS-Python/Include/codecs.h Python+Unicode/Include/codecs.h
--- CVS-Python/Include/codecs.h	Fri Mar 10 23:32:23 2000
+++ Python+Unicode/Include/codecs.h	Tue May  9 19:38:52 2000
@@ -15,37 +15,105 @@
 
    ------------------------------------------------------------------------ */
 
+/* Register a new codec search function.
+
+   As side effect, this tries to load the encodings package, if not
+   yet done, to make sure that it is always first in the list of
+   search functions.
+
+   The search_function's refcount is incremented by this function. */
+
 extern DL_IMPORT(int) PyCodec_Register(
        PyObject *search_function
        );
 
+/* Codec register lookup API.
+
+   Looks up the given encoding and returns a tuple (encoder, decoder,
+   stream reader, stream writer) of functions which implement the
+   different aspects of processing the encoding.
+
+   The encoding string is looked up converted to all lower-case
+   characters. This makes encodings looked up through this mechanism
+   effectively case-insensitive.
+
+   If no codec is found, a KeyError is set and NULL returned. 
+
+   As side effect, this tries to load the encodings package, if not
+   yet done. This is part of the lazy load strategy for the encodings
+   package.
+
+ */
+
 extern DL_IMPORT(PyObject *) _PyCodec_Lookup(
        const char *encoding
        );
 
+/* Generic codec based encoding API.
+
+   object is passed through the encoder function found for the given
+   encoding using the error handling method defined by errors. errors
+   may be NULL to use the default method defined for the codec.
+   
+   Raises a LookupError in case no encoder can be found.
+
+ */
+
+extern DL_IMPORT(PyObject *) PyCodec_Encode(
+       PyObject *object,
+       const char *encoding,
+       const char *errors
+       );
+
+/* Generic codec based decoding API.
+
+   object is passed through the decoder function found for the given
+   encoding using the error handling method defined by errors. errors
+   may be NULL to use the default method defined for the codec.
+   
+   Raises a LookupError in case no encoder can be found.
+
+ */
+
+extern DL_IMPORT(PyObject *) PyCodec_Decode(
+       PyObject *object,
+       const char *encoding,
+       const char *errors
+       );
+
+/* --- Codec Lookup APIs -------------------------------------------------- 
+
+   All APIs return a codec object with incremented refcount and are
+   based on _PyCodec_Lookup().  The same comments w/r to the encoding
+   name also apply to these APIs.
+
+*/
+
+/* Get an encoder function for the given encoding. */
+
 extern DL_IMPORT(PyObject *) PyCodec_Encoder(
        const char *encoding
        );
 
+/* Get a decoder function for the given encoding. */
+
 extern DL_IMPORT(PyObject *) PyCodec_Decoder(
        const char *encoding
        );
 
+/* Get a StreamReader factory function for the given encoding. */
+
 extern DL_IMPORT(PyObject *) PyCodec_StreamReader(
        const char *encoding,
        PyObject *stream,
        const char *errors
        );
 
-extern DL_IMPORT(PyObject *) PyCodec_Encode(
-       PyObject *object,
-       const char *encoding,
-       const char *errors
-       );
+/* Get a StreamWriter factory function for the given encoding. */
 
-extern DL_IMPORT(PyObject *) PyCodec_Decode(
-       PyObject *object,
+extern DL_IMPORT(PyObject *) PyCodec_StreamWriter(
        const char *encoding,
+       PyObject *stream,
        const char *errors
        );
 
diff -u -rbP -x *.o -x *.pyc -x Makefile -x *~ -x *.so -x add2lib -x pgen -x buildno -x config.* -x libpython* -x python -x Setup -x Setup.local -x Setup.thread -x hassignal -x Makefile.pre -x configure -x *.bak -x *.s -x DEADJOE -x *.rej -x *.orig -x Demo -x CVS -x Doc -x *.orig -x .#* -x distutils -x PC -x PCbuild -x *.py -x ACKS -x *.txt -x README CVS-Python/Include/unicodeobject.h Python+Unicode/Include/unicodeobject.h
--- CVS-Python/Include/unicodeobject.h	Thu Apr 13 11:11:33 2000
+++ Python+Unicode/Include/unicodeobject.h	Tue May  9 19:46:29 2000
@@ -265,8 +265,8 @@
       refcount.
 
    2. String and other char buffer compatible objects are decoded
-      under the assumptions that they contain UTF-8 data. Decoding
-      is done in "strict" mode.
+      under the assumptions that they contain data using the current
+      default encoding. Decoding is done in "strict" mode.
 
    3. All other objects raise an exception.
 
@@ -313,8 +313,7 @@
    parameters encoding and errors have the same semantics as the ones
    of the builtin unicode() API. 
 
-   Setting encoding to NULL causes the default encoding to be used
-   which is UTF-8.
+   Setting encoding to NULL causes the default encoding to be used.
 
    Error handling is set by errors which may also be set to NULL
    meaning to use the default handling defined for the codec. Default
@@ -325,6 +324,29 @@
    generic ones are documented.
 
 */
+
+/* --- Manage the default encoding ---------------------------------------- */
+
+/* Returns the currently active default encoding.
+
+   The default encoding is currently implemented as run-time settable
+   process global.  This may change in future versions of the
+   interpreter to become a parameter which is managed on a per-thread
+   basis.
+   
+ */
+
+extern DL_IMPORT(const char*) PyUnicode_GetDefaultEncoding();
+
+/* Sets the currently active default encoding.
+
+   Returns 0 on success, -1 in case of an error.
+   
+ */
+
+extern DL_IMPORT(int) PyUnicode_SetDefaultEncoding(
+    const char *encoding	/* Encoding name in standard form */
+    );
 
 /* --- Generic Codecs ----------------------------------------------------- */
 
diff -u -rbP -x *.o -x *.pyc -x Makefile -x *~ -x *.so -x add2lib -x pgen -x buildno -x config.* -x libpython* -x python -x Setup -x Setup.local -x Setup.thread -x hassignal -x Makefile.pre -x configure -x *.bak -x *.s -x DEADJOE -x *.rej -x *.orig -x Demo -x CVS -x Doc -x *.orig -x .#* -x distutils -x PC -x PCbuild -x *.py -x ACKS -x *.txt -x README CVS-Python/Modules/timemodule.c Python+Unicode/Modules/timemodule.c
--- CVS-Python/Modules/timemodule.c	Thu Apr 27 21:29:40 2000
+++ Python+Unicode/Modules/timemodule.c	Tue May  9 17:42:03 2000
@@ -421,7 +421,10 @@
 #endif /* HAVE_STRFTIME */
 
 #ifdef HAVE_STRPTIME
-/* extern char *strptime(); /* Enable this if it's not declared in <time.h> */
+
+#if 0
+extern char *strptime(); /* Enable this if it's not declared in <time.h> */
+#endif
 
 static PyObject *
 time_strptime(self, args)
diff -u -rbP -x *.o -x *.pyc -x Makefile -x *~ -x *.so -x add2lib -x pgen -x buildno -x config.* -x libpython* -x python -x Setup -x Setup.local -x Setup.thread -x hassignal -x Makefile.pre -x configure -x *.bak -x *.s -x DEADJOE -x *.rej -x *.orig -x Demo -x CVS -x Doc -x *.orig -x .#* -x distutils -x PC -x PCbuild -x *.py -x ACKS -x *.txt -x README CVS-Python/Objects/unicodeobject.c Python+Unicode/Objects/unicodeobject.c
--- CVS-Python/Objects/unicodeobject.c	Tue May  9 16:14:50 2000
+++ Python+Unicode/Objects/unicodeobject.c	Tue May  9 20:25:42 2000
@@ -117,6 +117,16 @@
 static PyUnicodeObject *unicode_freelist = NULL;
 static int unicode_freelist_size = 0;
 
+/* Default encoding to use and assume when NULL is passed as encoding
+   parameter; it is initialized by _PyUnicode_Init().
+
+   Always use the PyUnicode_SetDefaultEncoding() and
+   PyUnicode_GetDefaultEncoding() APIs to access this global. 
+
+*/
+
+static char unicode_default_encoding[100];
+
 /* --- Unicode Object ----------------------------------------------------- */
 
 static
@@ -366,7 +376,7 @@
 	Py_INCREF(unicode_empty);
 	return (PyObject *)unicode_empty;
     }
-    return PyUnicode_DecodeUTF8(s, len, "strict");
+    return PyUnicode_Decode(s, len, NULL, "strict");
 }
 
 PyObject *PyUnicode_Decode(const char *s,
@@ -376,10 +386,16 @@
 {
     PyObject *buffer = NULL, *unicode;
     
-    /* Shortcut for the default encoding UTF-8 */
-    if (encoding == NULL || 
-        (strcmp(encoding, "utf-8") == 0))
+    if (encoding == NULL) 
+	encoding = PyUnicode_GetDefaultEncoding();
+
+    /* Shortcuts for common default encodings */
+    if (strcmp(encoding, "utf-8") == 0)
         return PyUnicode_DecodeUTF8(s, size, errors);
+    else if (strcmp(encoding, "latin-1") == 0)
+        return PyUnicode_DecodeLatin1(s, size, errors);
+    else if (strcmp(encoding, "ascii") == 0)
+        return PyUnicode_DecodeASCII(s, size, errors);
 
     /* Decode via the codec registry */
     buffer = PyBuffer_FromMemory((void *)s, size);
@@ -428,11 +444,19 @@
         PyErr_BadArgument();
         goto onError;
     }
-    /* Shortcut for the default encoding UTF-8 */
-    if ((encoding == NULL || 
-	 (strcmp(encoding, "utf-8") == 0)) &&
-	errors == NULL)
+
+    if (encoding == NULL) 
+	encoding = PyUnicode_GetDefaultEncoding();
+
+    /* Shortcuts for common default encodings */
+    if (errors == NULL) {
+	if (strcmp(encoding, "utf-8") == 0)
         return PyUnicode_AsUTF8String(unicode);
+	else if (strcmp(encoding, "latin-1") == 0)
+	    return PyUnicode_AsLatin1String(unicode);
+	else if (strcmp(encoding, "ascii") == 0)
+	    return PyUnicode_AsASCIIString(unicode);
+    }
 
     /* Encode via the codec registry */
     v = PyCodec_Encode(unicode, encoding, errors);
@@ -476,6 +500,30 @@
     return -1;
 }
 
+const char *PyUnicode_GetDefaultEncoding()
+{
+    return unicode_default_encoding;
+}
+
+int PyUnicode_SetDefaultEncoding(const char *encoding)
+{
+    PyObject *v;
+    
+    /* Make sure the encoding is valid. As side effect, this also
+       loads the encoding into the codec registry cache. */
+    v = _PyCodec_Lookup(encoding);
+    if (v == NULL)
+	goto onError;
+    Py_DECREF(v);
+    strncpy(unicode_default_encoding,
+	    encoding, 
+	    sizeof(unicode_default_encoding));
+    return 0;
+
+ onError:
+    return -1;
+}
+
 /* --- UTF-8 Codec -------------------------------------------------------- */
 
 static 
@@ -772,7 +820,8 @@
     }
     else {
         PyErr_Format(PyExc_ValueError,
-                     "UTF-16 decoding error; unknown error handling code: %.400s",
+                     "UTF-16 decoding error; "
+		     "unknown error handling code: %.400s",
                      errors);
         return -1;
     }
@@ -3057,10 +3106,10 @@
 static char encode__doc__[] =
 "S.encode([encoding[,errors]]) -> string\n\
 \n\
-Return an encoded string version of S. Default encoding is 'UTF-8'.\n\
-errors may be given to set a different error handling scheme. Default\n\
-is 'strict' meaning that encoding errors raise a ValueError. Other\n\
-possible values are 'ignore' and 'replace'.";
+Return an encoded string version of S. Default encoding is the current\n\
+default string encoding. errors may be given to set a different error\n\
+handling scheme. Default is 'strict' meaning that encoding errors raise\n\
+a ValueError. Other possible values are 'ignore' and 'replace'.";
 
 static PyObject *
 unicode_encode(PyUnicodeObject *self, PyObject *args)
@@ -3816,7 +3865,7 @@
 static
 PyObject *unicode_str(PyUnicodeObject *self)
 {
-    return PyUnicode_AsUTF8String((PyObject *)self);
+    return PyUnicode_AsEncodedString((PyObject *)self, NULL, NULL);
 }
 
 static char strip__doc__[] =
@@ -4246,6 +4295,8 @@
 	return NULL;
     }
     uformat = PyUnicode_FromObject(format);
+    if (uformat == NULL)
+	return NULL;
     fmt = PyUnicode_AS_UNICODE(uformat);
     fmtcnt = PyUnicode_GET_SIZE(uformat);
 
@@ -4322,13 +4373,10 @@
 				    "incomplete format key");
 		    goto onError;
 		}
-		/* keys are converted to strings (using UTF-8) and
+		/* keys are converted to strings using UTF-8 and
 		   then looked up since Python uses strings to hold
 		   variables names etc. in its namespaces and we
-		   wouldn't want to break common idioms.  The
-		   alternative would be using Unicode objects for the
-		   lookup but u"abc" and "abc" have different hash
-		   values (on purpose). */
+		   wouldn't want to break common idioms. */
 		key = PyUnicode_EncodeUTF8(keystart,
 					   keylen,
 					   NULL);
@@ -4472,8 +4520,9 @@
 					"%s argument has non-string str()");
 			goto onError;
 		    }
-		    unicode = PyUnicode_DecodeUTF8(PyString_AS_STRING(temp),
+		    unicode = PyUnicode_Decode(PyString_AS_STRING(temp),
 						   PyString_GET_SIZE(temp),
+					       NULL,
 						   "strict");
 		    Py_DECREF(temp);
 		    temp = unicode;
@@ -4659,7 +4708,9 @@
         Py_FatalError("Unicode configuration error: "
 		      "sizeof(Py_UNICODE) != 2 bytes");
 
+    /* Init the implementation */
     unicode_empty = _PyUnicode_New(0);
+    strcpy(unicode_default_encoding, "utf-8");
 }
 
 /* Finalize the Unicode implementation */
diff -u -rbP -x *.o -x *.pyc -x Makefile -x *~ -x *.so -x add2lib -x pgen -x buildno -x config.* -x libpython* -x python -x Setup -x Setup.local -x Setup.thread -x hassignal -x Makefile.pre -x configure -x *.bak -x *.s -x DEADJOE -x *.rej -x *.orig -x Demo -x CVS -x Doc -x *.orig -x .#* -x distutils -x PC -x PCbuild -x *.py -x ACKS -x *.txt -x README CVS-Python/Python/bltinmodule.c Python+Unicode/Python/bltinmodule.c
--- CVS-Python/Python/bltinmodule.c	Mon May  8 15:13:55 2000
+++ Python+Unicode/Python/bltinmodule.c	Tue May  9 20:10:30 2000
@@ -193,8 +193,8 @@
 "unicode(string [, encoding[, errors]]) -> object\n\
 \n\
 Creates a new Unicode object from the given encoded string.\n\
-encoding defaults to 'utf-8' and errors, defining the error handling,\n\
-to 'strict'.";
+encoding defaults to the current default string encoding and \n\
+errors, defining the error handling, to 'strict'.";
 
 
 static PyObject *
diff -u -rbP -x *.o -x *.pyc -x Makefile -x *~ -x *.so -x add2lib -x pgen -x buildno -x config.* -x libpython* -x python -x Setup -x Setup.local -x Setup.thread -x hassignal -x Makefile.pre -x configure -x *.bak -x *.s -x DEADJOE -x *.rej -x *.orig -x Demo -x CVS -x Doc -x *.orig -x .#* -x distutils -x PC -x PCbuild -x *.py -x ACKS -x *.txt -x README CVS-Python/Python/codecs.c Python+Unicode/Python/codecs.c
--- CVS-Python/Python/codecs.c	Wed Apr  5 22:11:21 2000
+++ Python+Unicode/Python/codecs.c	Tue May  9 19:40:32 2000
@@ -55,14 +55,6 @@
     return 0;
 }
 
-/* Register a new codec search function.
-
-   As side effect, this tries to load the encodings package, if not
-   yet done, to make sure that it is always first in the list of
-   search functions.
-
-   The search_function's refcount is incremented by this function. */
-
 int PyCodec_Register(PyObject *search_function)
 {
     if (!import_encodings_called) {
@@ -117,7 +109,7 @@
    characters. This makes encodings looked up through this mechanism
    effectively case-insensitive.
 
-   If no codec is found, a KeyError is set and NULL returned. 
+   If no codec is found, a LookupError is set and NULL returned. 
 
    As side effect, this tries to load the encodings package, if not
    yet done. This is part of the lazy load strategy for the encodings
@@ -130,6 +122,10 @@
     PyObject *result, *args = NULL, *v;
     int i, len;
 
+    if (encoding == NULL) {
+	PyErr_BadArgument();
+	goto onError;
+    }
     if (_PyCodec_SearchCache == NULL || 
 	_PyCodec_SearchPath == NULL) {
 	PyErr_SetString(PyExc_SystemError,
diff -u -rbP -x *.o -x *.pyc -x Makefile -x *~ -x *.so -x add2lib -x pgen -x buildno -x config.* -x libpython* -x python -x Setup -x Setup.local -x Setup.thread -x hassignal -x Makefile.pre -x configure -x *.bak -x *.s -x DEADJOE -x *.rej -x *.orig -x Demo -x CVS -x Doc -x *.orig -x .#* -x distutils -x PC -x PCbuild -x *.py -x ACKS -x *.txt -x README CVS-Python/Python/sysmodule.c Python+Unicode/Python/sysmodule.c
--- CVS-Python/Python/sysmodule.c	Tue Apr 25 12:34:22 2000
+++ Python+Unicode/Python/sysmodule.c	Tue May  9 20:09:31 2000
@@ -143,6 +143,41 @@
 exit status will be one (i.e., failure).";
 
 static PyObject *
+sys_get_string_encoding(self, args)
+	PyObject *self;
+	PyObject *args;
+{
+	if (!PyArg_ParseTuple(args, ":get_string_encoding"))
+		return NULL;
+	return PyString_FromString(PyUnicode_GetDefaultEncoding());
+}
+
+static char get_string_encoding_doc[] =
+"get_string_encoding() -> string\n\
+\n\
+Return the current default string encoding used by the Unicode \n\
+implementation.";
+
+static PyObject *
+sys_set_string_encoding(self, args)
+	PyObject *self;
+	PyObject *args;
+{
+	char *encoding;
+	if (!PyArg_ParseTuple(args, "s:set_string_encoding", &encoding))
+		return NULL;
+	if (PyUnicode_SetDefaultEncoding(encoding))
+	    	return NULL;
+	Py_INCREF(Py_None);
+	return Py_None;
+}
+
+static char set_string_encoding_doc[] =
+"set_string_encoding(encoding)\n\
+\n\
+Set the current default string encoding used by the Unicode implementation.";
+
+static PyObject *
 sys_settrace(self, args)
 	PyObject *self;
 	PyObject *args;
@@ -266,6 +301,7 @@
 	/* Might as well keep this in alphabetic order */
 	{"exc_info",	sys_exc_info, 1, exc_info_doc},
 	{"exit",	sys_exit, 0, exit_doc},
+	{"get_string_encoding", sys_get_string_encoding, 1, get_string_encoding_doc},
 #ifdef COUNT_ALLOCS
 	{"getcounts",	sys_getcounts, 1},
 #endif
@@ -279,6 +315,7 @@
 #ifdef USE_MALLOPT
 	{"mdebug",	sys_mdebug, 1},
 #endif
+	{"set_string_encoding", sys_set_string_encoding, 1, set_string_encoding_doc},
 	{"setcheckinterval",	sys_setcheckinterval, 1, setcheckinterval_doc},
 	{"setprofile",	sys_setprofile, 0, setprofile_doc},
 	{"settrace",	sys_settrace, 0, settrace_doc},
diff -u -rP -x *.o -x *.pyc -x Makefile -x *~ -x *.so -x add2lib -x pgen -x buildno -x config.* -x libpython* -x python -x Setup -x Setup.local -x Setup.thread -x hassignal -x Makefile.pre -x configure -x *.bak -x *.s -x DEADJOE -x *.rej -x *.orig -x Demo -x CVS -x Doc -x *.orig -x .#* -x distutils -x PC -x PCbuild -x *.c -x *.h -x *.in -x output CVS-Python/Lib/test/test_unicode.py Python+Unicode/Lib/test/test_unicode.py
--- CVS-Python/Lib/test/test_unicode.py	Mon May  1 12:39:38 2000
+++ Python+Unicode/Lib/test/test_unicode.py	Tue May  9 20:25:10 2000
@@ -263,6 +263,12 @@
 assert '...%(foo)s...' % {u'foo':u"abc",u'def':123} == u'...abc...'
 assert '...%s...%s...%s...%s...' % (1,2,3,u"abc") == u'...1...2...3...abc...'
 assert '...%s...' % u"abc" == u'...abc...'
+try:
+    '...%s...äöü...' % u"abc"
+except ValueError:
+    pass
+else:
+    raise AssertionError, "'...%s...äöü...' % u'abc' failed to raise an exception"
 print 'done.'
 
 # Test builtin codecs
diff -u -rP -x *.o -x *.pyc -x Makefile -x *~ -x *.so -x add2lib -x pgen -x buildno -x config.* -x libpython* -x python -x Setup -x Setup.local -x Setup.thread -x hassignal -x Makefile.pre -x configure -x *.bak -x *.s -x DEADJOE -x *.rej -x *.orig -x Demo -x CVS -x Doc -x *.orig -x .#* -x distutils -x PC -x PCbuild -x *.c -x *.h -x *.in -x output CVS-Python/Misc/unicode.txt Python+Unicode/Misc/unicode.txt
--- CVS-Python/Misc/unicode.txt	Thu Apr 13 18:10:59 2000
+++ Python+Unicode/Misc/unicode.txt	Tue May  9 20:28:57 2000
@@ -976,11 +976,14 @@
         http://www.nada.kth.se/i18n/ucs/unicode-iso10646-oview.html
 
 For comparison:
-	Introducing Unicode to ECMAScript --
+	Introducing Unicode to ECMAScript (aka JavaScript) --
 	http://www-4.ibm.com/software/developer/library/internationalization-support.html
 
 IANA Character Set Names:
 	ftp://ftp.isi.edu/in-notes/iana/assignments/character-sets
+
+Discussion of UTF-8 and Unicode support for POSIX and Linux:
+	http://www.cl.cam.ac.uk/~mgk25/unicode.html
 
 Encodings:
 

--------------80712B74A5371BBFE8498331--