On 8/30/2011 11:03 PM, Stephen J. Turnbull wrote:

Guido van Rossum writes:
 > On Tue, Aug 30, 2011 at 7:55 PM, Stephen J. Turnbull <stephen@xemacs.org> wrote:

 > > For starters, one that doesn't ever return lone surrogates, but rather
 > > interprets surrogate pairs as Unicode code points as in UTF-16.  (This
 > > is not a Unicode standard definition, it's intended to be suggestive
 > > of why many app writers will be distressed if they must use Python
 > > unicode/str in a narrow build without a fairly comprehensive library
 > > that wraps the arrays in operations that treat unicode/str as an array
 > > of code points.)
 > 
 > That sounds like a contradiction -- it wouldn't be a UTF-16 array if
 > you couldn't tell that it was using UTF-16.

Well, that's why I wrote "intended to be suggestive".  The Unicode
Standard does not specify at all what the internal representation of
characters may be, it only specifies what their external behavior must
be when two processes communicate.  (For "process" as used in the
standard, think "Python modules" here, since we are concerned with the
problems of folks who develop in Python.)  When observing the behavior
of a Unicode process, there are no UTF-16 arrays or UTF-8 arrays or
even UTF-32 arrays; only arrays of characters.

Thus, according to the rules of handling a UTF-16 stream, it is an
error to observe a lone surrogate or a surrogate pair that isn't a
high-low pair (Unicode 6.0, Ch. 3 "Conformance", requirements C1 and
C8-C10).  That's what I mean by "can't tell it's UTF-16".  And I
understand those requirements to mean that operations on UTF-16
streams should produce UTF-16 streams, or raise an error.  Without
that closure property for basic operations on str, I think it's a bad
idea to say that the representation of text in a str in a pre-PEP-393
"narrow" build is UTF-16.  For many users and app developers, it
creates expectations that are not fulfilled.

It's true that common usage is that an array of code units that
usually conforms to UTF-16 may be called "UTF-16" without the closure
properties.  I just disagree with that usage, because there are two
camps that interpret "UTF-16" differently.  One side says, "we have an
array representation in UTF-16 that can handle all Unicode code points
efficiently, and if you think you need more, think again", while the
other says "it's too painful to have to check every result for valid
UTF-16, and we need a UTF-16 type that supports the usual array
operations on *characters* via the usual operators; if you think
otherwise, think again."

Note that despite the (presumed) resolution of the UTF-16 issue for
CPython by PEP 393, at some point a very similar discussion will take
place over "characters" anyway, because users and app developers are
going to want a type that handles composition sequences and/or
grapheme clusters for them, as well as comparison that respects
canonical equivalence, even if it is inefficient compared to str.
That's why I insisted on use of "array of code points" to describe the
PEP 393 str type, rather than "array of characters".

On topic:

So from reading all this discussion, I think this point is rather a key one... and it has been made repeatedly in different ways: Arrays are not suitable for manipulating Unicode character sequences, and the str type is an array with a veneer of text manipulation operations, which do not, and cannot, by themselves, efficiently implement Unicode character sequences.

Python wants to, should, and can implement UTF-16 streams, UTF-8 streams, and UTF-32 streams. It should, and can implement streams using other encodings as well, and also binary streams.

Python wants to, should, and can implement 8-bit, 16-bit, 32-bit, and 64-bit arrays. These are efficient to access, index, and slice.

Python implements a veneer on some 8-bit, 16-bit, and 32-bit arrays called str (this will be more true post-PEP 393, although it is true with caveats presently), which interpret array elements as code units (currently) or codepoints (post-PEP), and implements operations that are interesting for text processing, with caveats.

There is presently no support for arrays of Unicode grapheme clusters or composed characters. The Python type called str may or may not be properly documented (to the extent that there is confusion between the actual contents of the elements of the type, and the concept of character as defined by Unicode). From comments Guido has made, he is not interested in changing the efficiency or access methods of the str type to raise the level of support of Unicode to the composed character, or grapheme cluster concepts. The str type itself can presently be used to process other character encodings: if they are fixed width < 32-bit elements those encodings might be considered Unicode encodings, but there is no requirement that they are, and some operations on str may operate with knowledge of some Unicode semantics, so there are caveats.

So it seems that any semantics in support of composed characters, grapheme clusters, or codepoints-stored-as-<32-bit-code-units, must be created as either an add-on Python package (in Python) or C extension, or a combination. It could be based on extensions to the existing str type, or it could be based on the array type, or it could based on the bytes type. It could use an internal format of 32-bit codepoints, PEP 393 variable-size codepoints, or 8- or 16-bit codeunits.

In addition to the expected stream operations, character length, indexing, and slicing operations, additional more complex operations would be expected on Unicode string values: regular expressions, comparisons, collations, case-shifting, and perhaps more. RTL and LTR awareness would add complexity to all operations, or at least variants of all operations.

The questions are:

1) Is anyone interested in writing a PEP for such a thing?
2) Is anyone interested in writing an implementation for such a thing?
3) How many conflicting opinions and arguments will be spawned, making the poor person or persons above lose interest?

Brainstorming ideas (which may wander off-topic in some regards, but were all inspired by this discussion):

BI-0: Tom's analysis makes me think that the UTF-8 encoding, since it is smallest on the average language, and an implementation based on a foundation type of bytes or 'B' arrays, plus some side indexes of some sort, could be an appropriate starting point. UTF-8 is variable length, but so are composed characters and grapheme clusters. Building an array, each of whose units could hold the largest grapheme cluster would seem extremely inefficient, just like 32-bit Unicode is extremely inefficient for dealing with ASCII, so variable length units seem to be an imperative part of a solution. At least until one thinks up BI-2.

BI-1: Perhaps a 32-bit base, with the upper 11 bits used to cache character characteristics from various character attribute database lookups could be an effective alternative, but wouldn't eliminate the need for dealing with variable length units for length, indexing, and slicing operations.

BI-2: Maybe a 32-bit base would be useful so that one high bit could be used to flag that this character position actually holds an index to a multi-codepoint character, and the index would then hold the actual codes for that character. This would allow for at most 2^31 (and memory limited) different multi-codepoint characters in a string (or perhaps per application, if the multi-codepoint characters are shared between strings), but would suddenly allow array indexing of grapheme clusters and composed characters... with double-indexing required for multi-codepoint character access. [This idea seems similar to one that was mentioned elsewhere in this thread, suggesting that private use characters could be used to represent multi-codepoint characters, but (a) doesn't infringe on private uses, and (b) allows for a greater number of multi-codepoint characters to be used.]

BI-3: both BI-1 and BI-2 would also allow themselves to be built on top of PEP 393 str... allowing multi-codepoint-character-supporting applications to benefit from the space efficiencies of PEP 393 when no multi-codepoint characters are fed into the application.

BI-4: Since Unicode has 21-bit codepoints, one wonders if 24-bit array elements might be appropriate, rather than 32-bit. BI-2 could still operate, with a theoretical reduction to 2^23 possible multi-codepoint characters in an application. Access would be less efficient, but still O(1), and 25% of the space would be saved. This idea could be applied to PEP 393 independently of multi-codepoint character support.

BI-5: I'm pretty sure there are inappropriate or illegal sequences of combining characters that should not stand alone. One example of this is lone surrogates. Such characters out of an appropriate sequence could be flagged with a high-bit so that they could be quickly recognized as illegal Unicode, but codecs could be provided to allow them to round-trip, and applications could recognize immediately that they should be handled as "binary gibberish" in an otherwise Unicode stream. This idea could be applied to PEP 393 independently of additional multi-codepoint character support.

BI-6: Maybe another high bit could be used with a different codec error handler instead of using lone surrogates when decoding not-quite-conformant byte streams (such as OS filenames). Sad we didn't think of this one before doing all the lone surrogate stuff. Of course, this solution wouldn't work on narrow builds, because not even surrogates can represent high bits above Unicode codepoints! But once we have PEP 393, we _could_ replace inappropriate use of lone surrogates, with use of out-of-the-Unicode-codepoint range integers, without introducing ambiguity in the interpretation of lone surrogates. This idea could be applied to PEP 393 independently of multi-codepoint character support.

Glenn