Fredrik Lundh wrote:
...
But alright, I give up. I've wasted way too much time on this, my patches were rejected, and nobody seems to care. Not exactly inspiring.
I can understand how frustrating this is. Sometimes something seems just so clean and mathematically obvious that you can't see why others don't see it that way. A character is the "smallest unit of text." Strings are lists of characters. Characters in character sets have numbers. Python users should never know or care whether a string object is an 8-bit string or a Unicode string. There should be no distinction. u"" should be a syntactic shortcut. The primary reason I have not been involved is that I have not had a chance to look at the implementation and figure out if there is an overriding implementation-based reason to ignore the obvious right thing (e.g the right thing will break too much code or be too slow or...). "Unicode objects" should be an implementation detail (if they exist at all). Strings are strings are strings. The Python programmer shouldn't care about whether one string was read from a Unicode file and another from an ASCII file and one typed in with "u" and one without. It's all the same thing! If the programmer wants to do an explicit UTF-8 decode on a string (whether it is Unicode or 8-bit string...no difference) then that decode should proceed by looking at each character, deriving an integer and then treating that integer as an octet according to the UTF-8 specification. Char -> Integer -> Byte -> Char The end result (and hopefully the performance) would be the same but the model is much, much cleaner if there is only one kind of string. We should not ignore the example set by every other language (and yes, I'm including XML here :) ). I'm as desperate (if not as vocal) as Fredrick is here. -- Paul Prescod - ISOGEN Consulting Engineer speaking for himself It's difficult to extract sense from strings, but they're the only communication coin we can count on. - http://www.cs.yale.edu/~perlis-alan/quotes.html