[Python-Dev] PEP 393 Summer of Code Project

Wed Aug 31 10:09:25 CEST 2011

On 8/30/2011 11:03 PM, Stephen J. Turnbull wrote:
> Guido van Rossum writes:
>   >  On Tue, Aug 30, 2011 at 7:55 PM, Stephen J. Turnbull<stephen at xemacs.org>  wrote:
>
>   >  >  For starters, one that doesn't ever return lone surrogates, but rather
>   >  >  interprets surrogate pairs as Unicode code points as in UTF-16.  (This
>   >  >  is not a Unicode standard definition, it's intended to be suggestive
>   >  >  of why many app writers will be distressed if they must use Python
>   >  >  unicode/str in a narrow build without a fairly comprehensive library
>   >  >  that wraps the arrays in operations that treat unicode/str as an array
>   >  >  of code points.)
>   >
>   >  That sounds like a contradiction -- it wouldn't be a UTF-16 array if
>   >  you couldn't tell that it was using UTF-16.
>
> Well, that's why I wrote "intended to be suggestive".  The Unicode
> Standard does not specify at all what the internal representation of
> characters may be, it only specifies what their external behavior must
> be when two processes communicate.  (For "process" as used in the
> standard, think "Python modules" here, since we are concerned with the
> problems of folks who develop in Python.)  When observing the behavior
> of a Unicode process, there are no UTF-16 arrays or UTF-8 arrays or
> even UTF-32 arrays; only arrays of characters.
>
> Thus, according to the rules of handling a UTF-16 stream, it is an
> error to observe a lone surrogate or a surrogate pair that isn't a
> high-low pair (Unicode 6.0, Ch. 3 "Conformance", requirements C1 and
> C8-C10).  That's what I mean by "can't tell it's UTF-16".  And I
> understand those requirements to mean that operations on UTF-16
> streams should produce UTF-16 streams, or raise an error.  Without
> that closure property for basic operations on str, I think it's a bad
> idea to say that the representation of text in a str in a pre-PEP-393
> "narrow" build is UTF-16.  For many users and app developers, it
> creates expectations that are not fulfilled.
>
> It's true that common usage is that an array of code units that
> usually conforms to UTF-16 may be called "UTF-16" without the closure
> properties.  I just disagree with that usage, because there are two
> camps that interpret "UTF-16" differently.  One side says, "we have an
> array representation in UTF-16 that can handle all Unicode code points
> efficiently, and if you think you need more, think again", while the
> other says "it's too painful to have to check every result for valid
> UTF-16, and we need a UTF-16 type that supports the usual array
> operations on *characters* via the usual operators; if you think
> otherwise, think again."
>
> Note that despite the (presumed) resolution of the UTF-16 issue for
> CPython by PEP 393, at some point a very similar discussion will take
> place over "characters" anyway, because users and app developers are
> going to want a type that handles composition sequences and/or
> grapheme clusters for them, as well as comparison that respects
> canonical equivalence, even if it is inefficient compared to str.
> That's why I insisted on use of "array of code points" to describe the
> PEP 393 str type, rather than "array of characters".

On topic:

So from reading all this discussion, I think this point is rather a key 
one... and it has been made repeatedly in different ways:  Arrays are 
not suitable for manipulating Unicode character sequences, and the str 
type is an array with a veneer of text manipulation operations, which do 
not, and cannot, by themselves, efficiently implement Unicode character 
sequences.

Python wants to, should, and can implement UTF-16 streams, UTF-8 
streams, and UTF-32 streams.  It should, and can implement streams using 
other encodings as well, and also binary streams.

Python wants to, should, and can implement 8-bit, 16-bit, 32-bit, and 
64-bit arrays.  These are efficient to access, index, and slice.

Python implements a veneer on some 8-bit, 16-bit, and 32-bit arrays 
called str (this will be more true post-PEP 393, although it is true 
with caveats presently), which interpret array elements as code units 
(currently) or codepoints (post-PEP), and implements operations that are 
interesting for text processing, with caveats.

There is presently no support for arrays of Unicode grapheme clusters or 
composed characters.  The Python type called str may or may not be 
properly documented (to the extent that there is confusion between the 
actual contents of the elements of the type, and the concept of 
character as defined by Unicode).  From comments Guido has made, he is 
not interested in changing the efficiency or access methods of the str 
type to raise the level of support of Unicode to the composed character, 
or grapheme cluster concepts.  The str type itself can presently be used 
to process other character encodings: if they are fixed width < 32-bit 
elements those encodings might be considered Unicode encodings, but 
there is no requirement that they are, and some operations on str may 
operate with knowledge of some Unicode semantics, so there are caveats.

So it seems that any semantics in support of composed characters, 
grapheme clusters, or codepoints-stored-as-<32-bit-code-units, must be 
created as either an add-on Python package (in Python) or C extension, 
or a combination.  It could be based on extensions to the existing str 
type, or it could be based on the array type, or it could based on the 
bytes type.  It could use an internal format of 32-bit codepoints, PEP 
393 variable-size codepoints, or 8- or 16-bit codeunits.

In addition to the expected stream operations, character length, 
indexing, and slicing operations, additional more complex operations 
would be expected on Unicode string values: regular expressions, 
comparisons, collations, case-shifting, and perhaps more.  RTL and LTR 
awareness would add complexity to all operations, or at least variants 
of all operations.

The questions are:

1) Is anyone interested in writing a PEP for such a thing?
2) Is anyone interested in writing an implementation for such a thing?
3) How many conflicting opinions and arguments will be spawned, making 
the poor person or persons above lose interest?

Brainstorming ideas (which may wander off-topic in some regards, but 
were all inspired by this discussion):

BI-0: Tom's analysis makes me think that the UTF-8 encoding, since it is 
smallest on the average language, and an implementation based on a 
foundation type of bytes or 'B' arrays, plus some side indexes of some 
sort, could be an appropriate starting point.  UTF-8 is variable length, 
but so are composed characters and grapheme clusters. Building an array, 
each of whose units could hold the largest grapheme cluster would seem 
extremely inefficient, just like 32-bit Unicode is extremely inefficient 
for dealing with ASCII, so variable length units seem to be an 
imperative part of a solution.  At least until one thinks up BI-2.

BI-1: Perhaps a 32-bit base, with the upper 11 bits used to cache 
character characteristics from various character attribute database 
lookups could be an effective alternative, but wouldn't eliminate the 
need for dealing with variable length units for length, indexing, and 
slicing operations.

BI-2: Maybe a 32-bit base would be useful so that one high bit could be 
used to flag that this character position actually holds an index to a 
multi-codepoint character, and the index would then hold the actual 
codes for that character.  This would allow for at most 2^31 (and memory 
limited) different multi-codepoint characters in a string (or perhaps 
per application, if the multi-codepoint characters are shared between 
strings), but would suddenly allow array indexing of grapheme clusters 
and composed characters... with double-indexing required for 
multi-codepoint character access. [This idea seems similar to one that 
was mentioned elsewhere in this thread, suggesting that private use 
characters could be used to represent multi-codepoint characters, but 
(a) doesn't infringe on private uses, and (b) allows for a greater 
number of multi-codepoint characters to be used.]

BI-3: both BI-1 and BI-2 would also allow themselves to be built on top 
of PEP 393 str... allowing multi-codepoint-character-supporting 
applications to benefit from the space efficiencies of PEP 393 when no 
multi-codepoint characters are fed into the application.

BI-4: Since Unicode has 21-bit codepoints, one wonders if 24-bit array 
elements might be appropriate, rather than 32-bit.  BI-2 could still 
operate, with a theoretical reduction to 2^23 possible multi-codepoint 
characters in an application.  Access would be less efficient, but still 
O(1), and 25% of the space would be saved.  This idea could be applied 
to PEP 393 independently of multi-codepoint character support.

BI-5: I'm pretty sure there are inappropriate or illegal sequences of 
combining characters that should not stand alone.  One example of this 
is lone surrogates.  Such characters out of an appropriate sequence 
could be flagged with a high-bit so that they could be quickly 
recognized as illegal Unicode, but codecs could be provided to allow 
them to round-trip, and applications could recognize immediately that 
they should be handled as "binary gibberish" in an otherwise Unicode 
stream. This idea could be applied to PEP 393 independently of 
additional multi-codepoint character support.

BI-6: Maybe another high bit could be used with a different codec error 
handler instead of using lone surrogates when decoding 
not-quite-conformant byte streams (such as OS filenames).  Sad we didn't 
think of this one before doing all the lone surrogate stuff.  Of course, 
this solution wouldn't work on narrow builds, because not even 
surrogates can represent high bits above Unicode codepoints!  But once 
we have PEP 393, we _could_ replace inappropriate use of lone 
surrogates, with use of out-of-the-Unicode-codepoint range integers, 
without introducing ambiguity in the interpretation of lone surrogates. 
This idea could be applied to PEP 393 independently of multi-codepoint 
character support.

Glenn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20110831/7b5029ab/attachment-0001.html>