Are you sure you understand what we are arguing about?
Here's what I thought we were arguing about:
If you put a bunch of "funny characters" into a Python string literal, and then compare that string literal against a Unicode object, should those funny characters be treated as logical units of text (characters) or as bytes? And if bytes, should some transformation be automatically performed to have those bytes be reinterpreted as characters according to some particular encoding scheme (probably UTF-8).
I claim that we should *as far as possible* treat strings as character lists and not add any new functionality that depends on them being byte list. Ideally, we could add a byte array type and start deprecating the use of strings in that manner. Yes, it will take a long time to fix this bug but that's what happens when good software lives a long time and the world changes around it.
Earlier, you quoted some reference documentation that defines 8-bit strings as containing characters. That's taken out of context -- this was written in a time when there was (for most people anyway) no difference between characters and bytes, and I really meant bytes.
Actually, I think that that was Fredrik.
Yes, I came across the post again later. Sorry.
Anyhow, you wrote the documentation that way because it was the most intuitive way of thinking about strings. It remains the most intuitive way. I think that that was the point Fredrik was trying to make.
I just wish he made the point more eloquently. The eff-bot seems to be in a crunchy mood lately...
We can't make "byte-list" strings go away soon but we can start moving people towards the "character-list" model. In concrete terms I would suggest that old fashioned lists be automatically coerced to Unicode by interpreting each byte as a Unicode character. Trying to go the other way could cause the moral equivalent of an OverflowError but that's not a problem.
Traceback (innermost last): File "<stdin>", line 1, in ? OverflowError: long int too long to convert
And just as with ints and longs, we would expect to eventually unify strings and unicode strings (but not byte arrays).
OK, you've made your claim -- like Fredrik, you want to interpret 8-bit strings as Latin-1 when converting (not just comparing!) them to Unicode.
I don't think I've heard a good *argument* for this rule though. "A character is a character is a character" sounds like an axiom to me -- something you can't prove or disprove rationally.
I have a bunch of good reasons (I think) for liking UTF-8: it allows you to convert between Unicode and 8-bit strings without losses, Tcl uses it (so displaying Unicode in Tkinter *just* *works*...), it is not Western-language-centric.
Another reason: while you may claim that your (and /F's, and Just's) preferred solution doesn't enter into the encodings issue, I claim it does: Latin-1 is just as much an encoding as any other one.
I claim that as long as we're using an encoding we might as well use the most accepted 8-bit encoding of Unicode as the default encoding.
I also think that the issue is blown out of proportions: this ONLY happens when you use Unicode objects, and it ONLY matters when some other part of the program uses 8-bit string objects containing non-ASCII characters. Given the long tradition of using different encodings in 8-bit strings, at that point it is anybody's guess what encoding is used, and UTF-8 is a better guess than Latin-1.
--Guido van Rossum (home page: http://www.python.org/%7Eguido/)