Sorry for the long message. Of course you need only respond to that which is interesting to you. I don't think that most of it is redundant.
Guido van Rossum wrote:
OK, you've made your claim -- like Fredrik, you want to interpret 8-bit strings as Latin-1 when converting (not just comparing!) them to Unicode.
If the user provides an explicit conversion function (e.g. UTF-8-decode) then of course we should use that function. Under my character is a character is a character model, this "conversion" is morally equivalent to ROT-13, strupr or some other text->text translation. So you could apply UTF-8-decode even to a Unicode string as long as each character in the string has ord()<256 (so that it could be interpreted as a character representation for a byte).
I don't think I've heard a good *argument* for this rule though. "A character is a character is a character" sounds like an axiom to me -- something you can't prove or disprove rationally.
I don't see it as an axiom, but rather as a design decision you make to keep your language simple. Along the lines of "all values are objects" and (now) all integer values are representable with a single type. Are you happy with this?
a="\244" b=u"\244" assert len(a)==len(b) assert ord(a)==ord(b)
# same thing, right? print b==a # Traceback (most recent call last): # File "<stdin>", line 1, in ? # UnicodeError: UTF-8 decoding error: unexpected code byte
If I type "\244" it means I want character 244, not the first half of a UTF-8 escape sequence. "\244" is a string with one character. It has no encoding. It is not latin-1. It is not UTF-8. It is a string with one character and should compare as equal with another string with the same character.
I would laugh my ass off if I was using Perl and it did something weird like this to me (as long as it didn't take a month to track down the bug!). Now it isn't so funny.
I have a bunch of good reasons (I think) for liking UTF-8:
I'm not against UTF-8. It could be an internal representation for some Unicode objects.
it allows you to convert between Unicode and 8-bit strings without losses,
Here's the heart of our disagreement:
****** I don't want, in Py3K, to think about "converting between Unicode and 8-bit strings." I want strings and I want byte-arrays and I want to worry about converting between *them*. There should be only one string type, its characters should all live in the Unicode character repertoire and the character numbers should all come from Unicode. "Special" characters can be assigned to the Unicode Private User Area. Byte arrays would be entirely seperate and would be converted to Unicode strings with explicit conversion functions. *****
In the meantime I'm just trying to get other people thinking in this mode so that the transition is easier. If I see people embedding UTF-8 escape sequences in literal strings today, I'm going to hit them.
I recognize that we can't design the universe right now but we could agree on this direction and use it to guide our decision-making.
By the way, if we DID think of 8-bit strings as essentially "byte arrays" then let's use that terminology and imagine some future documentation:
"Python's string type is equivalent to a list of bytes. For clarity, we will call this type a byte list from now on. In contexts where a Unicode character-string is desired, Python automatically converts byte lists to charcter strings by doing a UTF-8 decode on them."
What would you think if Java had a default (I say "magical") conversion from byte arrays to character strings.
The only reason we are discussing this is because Python strings have a dual personality which was useful in the past but will (IMHO, of course) become increasingly confusing in the future. We want the best of both worlds without confusing anybody and I don't think that we can have it.
If you want 8-bit strings to be really byte arrays in perpetuity then let's be consistent in that view. We can compare them to Unicode as we would two completely separate types. "U" comes after "S" so unicode strings always compare greater than 8-bit strings. The use of the word "string" for both objects can be considered just a historical accident.
Tcl uses it (so displaying Unicode in Tkinter *just* *works*...),
Don't follow this entirely. Shouldn't the next version of TKinter accept and return Unicode strings? It would be rather ugly for two Unicode-aware systems (Python and TK) to talk to each other in 8-bit strings. I mean I don't care what you do at the C level but at the Python level arguments should be "just strings."
Consider that len() on the TKinter side would return a different value than on the Python side.
What about integral indexes into buffers? I'm totally ignorant about TKinter but let me ask wouldn't Tkinter say (e.g.) that the cursor is between the 5th and 6th character when in an 8-bit string the equivalent index might be the 11th or 12th byte?
it is not Western-language-centric.
If you look at encoding efficiency it is.
Another reason: while you may claim that your (and /F's, and Just's) preferred solution doesn't enter into the encodings issue, I claim it does: Latin-1 is just as much an encoding as any other one.
The fact that my proposal has the same effect as making Latin-1 the "default encoding" is a near-term side effect of the definition of Unicode. My long term proposal is to do away with the concept of 8-bit strings (and thus, conversions from 8-bit to Unicode) altogether. One string to rule them all!
Is Unicode going to be the canonical Py3K character set or will we have different objects for different character sets/encodings with different default (I say "magical") conversions between them. Such a design would not be entirely insane though it would be a PITA to implement and maintain. If we aren't ready to establish Unicode as the one true character set then we should probably make no special concessions for Unicode at all. Let a thousand string objects bloom!
Even if we agreed to allow many string objects, byte==character should not be the default string object. Unicode should be the default.
I also think that the issue is blown out of proportions: this ONLY happens when you use Unicode objects, and it ONLY matters when some other part of the program uses 8-bit string objects containing non-ASCII characters.
Won't this be totally common? Most people are going to use 8-bit literals in their program text but work with Unicode data from XML parsers, COM, WebDAV, Tkinter, etc?
Given the long tradition of using different encodings in 8-bit strings, at that point it is anybody's guess what encoding is used, and UTF-8 is a better guess than Latin-1.
If we are guessing then we are doing something wrong. My answer to the question of "default encoding" falls out naturally from a certain way of looking at text, popularized in various other languages and increasingly "the norm" on the Web. If you accept the model (a character is a character is a character), the right behavior is obvious.
Nobody is ever going to have trouble understanding how this works. Choose simplicity!