Guido van Rossum wrote, about how to represent strings:
> Paul, we're both just saying the same thing over and over without
> convincing each other. I'll wait till someone who wasn't in this
> debate before chimes in.
I'm with Paul and Federick on this one - at least about characters being the
atoms of a string. We **have** to be able to refer to **characters** in a
string, and without guessing. Otherwise, how could you ever construct a
test, like theString==[a particular japanese ideograph]? If we do it by
having a "string" datatype, which is really a byte list, and a
"unicodeString" datatype which is a list of abstract characters, I'd say
everyone could get used to working with them. We'd have to supply
conversion functions, of course.
This route might be the easiest to understand for users. We'd have to be
very clear about what file.read() would return, for example, and all those
similar read and write functions. And we'd have to work out how real 8-bit
calls (like writing to a socket?) would play with the new types.
For extra clarity, we could leave string the way it is, introduce stringU
(unicode string) **and** string8 (Latin-1 or byte list, whichever seems to
be the best equivalent to the current string). Then we would deprecate
string in favor of string8. Then if tcl and perl go to unicode strings we
pass them a stringU, and if they go some other way, we pass them something
else. COme to think of it, we need some some data type that will continue
to work with c and c++. Would that be string8 or would we keep string for
Clarity and ease of use for the user should be primary, fast implementations
next. If we didn't care about ease of use and clarity, we could all use
Scheme or c, don't use sight of it.
I'd suggest we could create some use cases or scenarios for this area -
needs input from those who know encodings and low level Python stuff better
than I. Then we could examine more systematically how well various
approaches would work out.