[I18n-sig] Re: [Python-Dev] Unicode debate

Paul Prescod paul@prescod.net
Mon, 01 May 2000 15:38:29 -0500

Uche asked for a summary so I cc:ed the xml-sig.

Guido van Rossum wrote:
> ...
> OK.  I really meant recoding in UTF-8 -- I maintain that there are
> lots of forces that prevent recoding most ISO-2022-JP documents in
> UTF-8.

Absolutely agree.
> Are you sure you understand what we are arguing about?

Here's what I thought we were arguing about:

If you put a bunch of "funny characters" into a Python string literal,
and then compare that string literal against a Unicode object, should
those funny characters be treated as logical units of text (characters)
or as bytes? And if bytes, should some transformation be automatically
performed to have those bytes be reinterpreted as characters according
to some particular encoding scheme (probably UTF-8).

I claim that we should *as far as possible* treat strings as character
lists and not add any new functionality that depends on them being byte
list. Ideally, we could add a byte array type and start deprecating the
use of strings in that manner. Yes, it will take a long time to fix this
bug but that's what happens when good software lives a long time and the
world changes around it.

> Earlier, you quoted some reference documentation that defines 8-bit
> strings as containing characters.  That's taken out of context -- this
> was written in a time when there was (for most people anyway) no
> difference between characters and bytes, and I really meant bytes.

Actually, I think that that was Fredrik. 

Anyhow, you wrote the documentation that way because it was the most
intuitive way of thinking about strings. It remains the most intuitive
way. I think that that was the point Fredrik was trying to make.

We can't make "byte-list" strings go away soon but we can start moving
people towards the "character-list" model. In concrete terms I would
suggest that old fashioned lists be automatically coerced to Unicode by
interpreting each byte as a Unicode character. Trying to go the other
way could cause the moral equivalent of an OverflowError but that's not
a problem. 

>>> a=1000000000000000000000000000000000000L
>>> int(a)
Traceback (innermost last):
  File "<stdin>", line 1, in ?
OverflowError: long int too long to convert

And just as with ints and longs, we would expect to eventually unify
strings and unicode strings (but not byte arrays).

 Paul Prescod  - ISOGEN Consulting Engineer speaking for himself
It's difficult to extract sense from strings, but they're the only
communication coin we can count on. 
	- http://www.cs.yale.edu/~perlis-alan/quotes.html