[Python-Dev] Unicode debate

Guido van Rossum guido@python.org
Wed, 03 May 2000 11:48:32 -0400


> >Guido van Rossum writes:
> >My suggested criterion is that 1.6 not screw things up in a way that
> >we'll regret when 1.7 rolls around.  UTF-8 probably does back us into
> >a corner that 

> Andrew M. Kuchling writes:
> Doh!  To complete that paragraph: Magic conversions assuming UTF-8
> does back us into a corner that is hard to get out of later.  Magic
> conversions assuming Latin1 or ASCII are a bit better, but I'd lean
> toward the draconian solution: we don't know what we're doing, so do
> nothing and require the user to explicitly convert between Unicode and
> 8-bit strings in a user-selected encoding.

GvR responds:
That's what Ping suggested.  My reason for proposing default
conversions from ASCII is that there is much code that deals with
character strings in a fairly abstract sense and that would work out
of the box (or after very small changes) with Unicode strings.  This
code often uses some string literals containing ASCII characters.  An
arbitrary example: code to reformat a text paragraph; another: an XML
parser.  These look for certain ASCII characters given as literals in
the code (" ", "<" and so on) but the algorithm is essentially
independent of what encoding is used for non-ASCII characters.  (I
realize that the text reformatting example doesn't work for all
Unicode characters because its assumption that all characters have
equal width is broken -- but at the very least it should work with
Latin-1 or Greek or Cyrillic stored in Unicode strings.)

It's the same as for ints: a function to calculate the GCD works with
ints as well as long ints without change, even though it references
the int constant 0.  In other words, we want string-processing code to
be just as polymorphic as int-processing code.

--Guido van Rossum (home page: http://www.python.org/~guido/)