[XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate

Guido van Rossum guido@python.org
Mon, 01 May 2000 23:31:54 -0400


Tom Passin:
> I'm with Paul and Federick on this one - at least about characters being the
> atoms of a string.  We **have** to be able to refer to **characters** in a
> string, and without guessing.  Otherwise, how could you ever construct a
> test, like theString[3]==[a particular japanese ideograph]?  If we do it by
> having a "string" datatype, which is really a byte list, and a
> "unicodeString" datatype which is a list of abstract characters, I'd say
> everyone could get used to working with them.  We'd have to supply
> conversion functions, of course.

You seem unfamiliar with the details of the implementation we're
proposing?  We already have two datatypes, 8-bit string (call it byte
array) and Unicode string.  There are conversions between them:
explicit conversions such as u.encode("utf-8") or unicode(s,
"latin-1") and implicit conversions used in situations like u+s or
u==s.  The whole discussion is *only* about what the default
conversion in the latter cases should be -- the rest of the
implementation is rock solid and works well.

Users can accomplish what you are proposing by simply ensuring that
theString is a Unicode string.

> This route might be the easiest to understand for users.  We'd have to be
> very clear about what file.read() would return, for example, and all those
> similar read and write functions.  And we'd have to work out how real 8-bit
> calls (like writing to a socket?) would play with the new types.

These are all well defined -- they all deal in 8-bit strings
internally, and all use the default conversions when given Unicode
strings.  Programs that only deal in 8-bit strings don't need to
change.  Programs that want to deal with Unicode and sockets, for
example, must know what encoding to use on the socket, and if it's not
the default encoding, must use explicit conversions.

> For extra clarity, we could leave string the way it is, introduce stringU
> (unicode string) **and** string8 (Latin-1 or byte list, whichever seems to
> be the best equivalent to the current string).  Then we would deprecate
> string in favor of string8.  Then if tcl and perl go to unicode strings we
> pass them a stringU, and if they go some other way, we pass them something
> else.  COme to think of it, we need some some data type that will continue
> to work with c and c++.  Would that be string8 or would we keep string for
> that purpose?

What would be the difference between string and string8?

> Clarity and ease of use for the user should be primary, fast implementations
> next.  If we didn't care about ease of use and clarity, we could all use
> Scheme or c, don't use sight of it.
> 
> I'd suggest we could create some use cases or scenarios for this area -
> needs input from those who know encodings and low level Python stuff better
> than I.  Then we could examine more systematically how well various
> approaches would work out.

Very good.

Here's one usage scenario.

A Japanese user is reading lines from a file encoded in ISO-2022-JP.
The readline() method returns 8-bit strings in that encoding (the file
object doesn't do any decoding).  She realizes that she wants to do
some character-level processing on the file so she decides to convert
the strings to Unicode.

I believe that whether the default encoding is UTF-8 or Latin-1
doesn't matter for here -- both are wrong, she needs to write explicit
unicode(line, "iso-2022-jp") code anyway.  I would argue that UTF-8 is
"better", because interpreting ISO-2022-JP data as UTF-8 will most
likely give an exception (when a \300 range byte isn't followed by a
\200 range byte) -- while interpreting it as Latin-1 will silently do
the wrong thing.  (An explicit error is always better than silent
failure.)

I'd love to discuss other scenarios.

--Guido van Rossum (home page: http://www.python.org/~guido/)