[I18n-sig] Re: [Python-Dev] Unicode debate

Paul Prescod paul@prescod.net
Mon, 01 May 2000 19:19:20 -0500


Sorry for the long message. Of course you need only respond to that
which is interesting to you. I don't think that most of it is redundant.

Guido van Rossum wrote:
> 
> ...
> 
> OK, you've made your claim -- like Fredrik, you want to interpret
> 8-bit strings as Latin-1 when converting (not just comparing!) them to
> Unicode.

If the user provides an explicit conversion function (e.g. UTF-8-decode)
then of course we should use that function. Under my character is a
character is a character model, this "conversion" is morally equivalent
to ROT-13, strupr or some other text->text translation. So you could
apply UTF-8-decode even to a Unicode string as long as each character in
the string has ord()<256 (so that it could be interpreted as a character
representation for a byte).

> I don't think I've heard a good *argument* for this rule though.  "A
> character is a character is a character" sounds like an axiom to me --
> something you can't prove or disprove rationally.

I don't see it as an axiom, but rather as a design decision you make to
keep your language simple. Along the lines of "all values are objects"
and (now) all integer values are representable with a single type. Are
you happy with this?

a="\244"
b=u"\244"
assert len(a)==len(b)
assert ord(a[0])==ord(b[0])

# same thing, right?
print b==a
# Traceback (most recent call last):
#  File "<stdin>", line 1, in ?
# UnicodeError: UTF-8 decoding error: unexpected code byte

If I type "\244" it means I want character 244, not the first half of a
UTF-8 escape sequence. "\244" is a string with one character. It has no
encoding. It is not latin-1. It is not UTF-8. It is a string with one
character and should compare as equal with another string with the same
character.

I would laugh my ass off if I was using Perl and it did something weird
like this to me (as long as it didn't take a month to track down the
bug!). Now it isn't so funny.

> I have a bunch of good reasons (I think) for liking UTF-8: 

I'm not against UTF-8. It could be an internal representation for some
Unicode objects.

> it allows
> you to convert between Unicode and 8-bit strings without losses, 

Here's the heart of our disagreement:

******
I don't want, in Py3K, to think about "converting between Unicode and
8-bit strings." I want strings and I want byte-arrays and I want to
worry about converting between *them*. There should be only one string
type, its characters should all live in the Unicode character repertoire
and the character numbers should all come from Unicode. "Special"
characters can be assigned to the Unicode Private User Area. Byte arrays
would be entirely seperate and would be converted to Unicode strings
with explicit conversion functions.
*****

In the meantime I'm just trying to get other people thinking in this
mode so that the transition is easier. If I see people embedding UTF-8
escape sequences in literal strings today, I'm going to hit them.

I recognize that we can't design the universe right now but we could
agree on this direction and use it to guide our decision-making.

By the way, if we DID think of 8-bit strings as essentially "byte
arrays" then let's use that terminology and imagine some future
documentation:

"Python's string type is equivalent to a list of bytes. For clarity, we
will call this type a byte list from now on. In contexts where a Unicode
character-string is desired, Python automatically converts byte lists to
charcter strings by doing a UTF-8 decode on them." 

What would you think if Java had a default (I say "magical") conversion
from byte arrays to character strings.

The only reason we are discussing this is because Python strings have a
dual personality which was useful in the past but will (IMHO, of course)
become increasingly confusing in the future. We want the best of both
worlds without confusing anybody and I don't think that we can have it.

If you want 8-bit strings to be really byte arrays in perpetuity then
let's be consistent in that view. We can compare them to Unicode as we
would two completely separate types. "U" comes after "S" so unicode
strings always compare greater than 8-bit strings. The use of the word
"string" for both objects can be considered just a historical accident.

> Tcl uses it (so displaying Unicode in Tkinter *just* *works*...), 

Don't follow this entirely. Shouldn't the next version of TKinter accept
and return Unicode strings? It would be rather ugly for two
Unicode-aware systems (Python and TK) to talk to each other in 8-bit
strings. I mean I don't care what you do at the C level but at the
Python level arguments should be "just strings."

Consider that len() on the TKinter side would return a different value
than on the Python side. 

What about integral indexes into buffers? I'm totally ignorant about
TKinter but let me ask wouldn't Tkinter say (e.g.) that the cursor is
between the 5th and 6th character when in an 8-bit string the equivalent
index might be the 11th or 12th byte?

> it is not Western-language-centric.

If you look at encoding efficiency it is.

> Another reason: while you may claim that your (and /F's, and Just's)
> preferred solution doesn't enter into the encodings issue, I claim it
> does: Latin-1 is just as much an encoding as any other one.

The fact that my proposal has the same effect as making Latin-1 the
"default encoding" is a near-term side effect of the definition of
Unicode. My long term proposal is to do away with the concept of 8-bit
strings (and thus, conversions from 8-bit to Unicode) altogether. One
string to rule them all!

Is Unicode going to be the canonical Py3K character set or will we have
different objects for different character sets/encodings with different
default (I say "magical") conversions between them. Such a design would
not be entirely insane though it would be a PITA to implement and
maintain. If we aren't ready to establish Unicode as the one true
character set then we should probably make no special concessions for
Unicode at all. Let a thousand string objects bloom!

Even if we agreed to allow many string objects, byte==character should
not be the default string object. Unicode should be the default.

> I also think that the issue is blown out of proportions: this ONLY
> happens when you use Unicode objects, and it ONLY matters when some
> other part of the program uses 8-bit string objects containing
> non-ASCII characters.  

Won't this be totally common? Most people are going to use 8-bit
literals in their program text but work with Unicode data from XML
parsers, COM, WebDAV, Tkinter, etc?

> Given the long tradition of using different
> encodings in 8-bit strings, at that point it is anybody's guess what
> encoding is used, and UTF-8 is a better guess than Latin-1.

If we are guessing then we are doing something wrong. My answer to the
question of "default encoding" falls out naturally from a certain way of
looking at text, popularized in various other languages and increasingly
"the norm" on the Web. If you accept the model (a character is a
character is a character), the right behavior is obvious. 

"\244"==u"\244"

Nobody is ever going to have trouble understanding how this works.
Choose simplicity!

-- 
 Paul Prescod  - ISOGEN Consulting Engineer speaking for himself
It's difficult to extract sense from strings, but they're the only
communication coin we can count on. 
	- http://www.cs.yale.edu/~perlis-alan/quotes.html