Guido van Rossum wrote:
I think that maybe an important point is getting lost here. I could be wrong, but it seems that all of this emphasis on encodings is misplaced.
In practical applications that manipulate text, encodings creep up all the time.
I'm not saying that encodings are unimportant. I'm saying that that they are *different* than what Fredrik was talking about. He was talking about a coherent logical model for characters and character strings based on the conventions of more modern languages and systems than C and Python.
How can we make the transition to a "binary goops are not strings" world easiest?
I'm afraid that's a bigger issue than we can solve for Python 1.6.
I understand that we can't fix the problem now. I just think that we shouldn't go out of our ways to make it worst.
If we make byte-array strings "magically" cast themselves into character-strings, people will expect that behavior forever.
It doesn't meet the definition of string used in the Unicode spec., nor in XML, nor in Java, nor at the W3C nor in most other up and coming specifications.
OK, so that's a good indication of where you're coming from. Maybe you should spend a little more time in the trenches and a little less in standards bodies. Standards are good, but sometimes disconnected from reality (remember ISO networking? :-).
As far as I know, XML and Java are used a fair bit in the real world...even somewhat in Asia. In fact, there is a book titled "XML and Java" written by three Japanese men.
And this is exactly why encodings will remain important: entities encoded in ISO-2022-JP have no compelling reason to be recoded permanently into ISO10646, and there are lots of forces that make it convenient to keep it encoded in ISO-2022-JP (like existing tools).
You cannot recode an ISO-2022-JP document into ISO10646 because 10646 is a character *set* and not an encoding. ISO-2022-JP says how you should represent characters in terms of bits and bytes. ISO10646 defines a mapping from integers to characters.
They are both important, but separate. I think that this automagical re-encoding conflates them.