Guido van Rossum writes: > On Tue, Aug 30, 2011 at 7:55 PM, Stephen J. Turnbull <stephen@xemacs.org> wrote: > > For starters, one that doesn't ever return lone surrogates, but rather > > interprets surrogate pairs as Unicode code points as in UTF-16. (This > > is not a Unicode standard definition, it's intended to be suggestive > > of why many app writers will be distressed if they must use Python > > unicode/str in a narrow build without a fairly comprehensive library > > that wraps the arrays in operations that treat unicode/str as an array > > of code points.) > > That sounds like a contradiction -- it wouldn't be a UTF-16 array if > you couldn't tell that it was using UTF-16. Well, that's why I wrote "intended to be suggestive". The Unicode Standard does not specify at all what the internal representation of characters may be, it only specifies what their external behavior must be when two processes communicate. (For "process" as used in the standard, think "Python modules" here, since we are concerned with the problems of folks who develop in Python.) When observing the behavior of a Unicode process, there are no UTF-16 arrays or UTF-8 arrays or even UTF-32 arrays; only arrays of characters. Thus, according to the rules of handling a UTF-16 stream, it is an error to observe a lone surrogate or a surrogate pair that isn't a high-low pair (Unicode 6.0, Ch. 3 "Conformance", requirements C1 and C8-C10). That's what I mean by "can't tell it's UTF-16". And I understand those requirements to mean that operations on UTF-16 streams should produce UTF-16 streams, or raise an error. Without that closure property for basic operations on str, I think it's a bad idea to say that the representation of text in a str in a pre-PEP-393 "narrow" build is UTF-16. For many users and app developers, it creates expectations that are not fulfilled. It's true that common usage is that an array of code units that usually conforms to UTF-16 may be called "UTF-16" without the closure properties. I just disagree with that usage, because there are two camps that interpret "UTF-16" differently. One side says, "we have an array representation in UTF-16 that can handle all Unicode code points efficiently, and if you think you need more, think again", while the other says "it's too painful to have to check every result for valid UTF-16, and we need a UTF-16 type that supports the usual array operations on *characters* via the usual operators; if you think otherwise, think again." Note that despite the (presumed) resolution of the UTF-16 issue for CPython by PEP 393, at some point a very similar discussion will take place over "characters" anyway, because users and app developers are going to want a type that handles composition sequences and/or grapheme clusters for them, as well as comparison that respects canonical equivalence, even if it is inefficient compared to str. That's why I insisted on use of "array of code points" to describe the PEP 393 str type, rather than "array of characters".