[Python-Dev] PEP 393 Summer of Code Project
Glenn Linderman
v+python at g.nevcal.com
Wed Aug 31 10:09:25 CEST 2011
On 8/30/2011 11:03 PM, Stephen J. Turnbull wrote:
> Guido van Rossum writes:
> > On Tue, Aug 30, 2011 at 7:55 PM, Stephen J. Turnbull<stephen at xemacs.org> wrote:
>
> > > For starters, one that doesn't ever return lone surrogates, but rather
> > > interprets surrogate pairs as Unicode code points as in UTF-16. (This
> > > is not a Unicode standard definition, it's intended to be suggestive
> > > of why many app writers will be distressed if they must use Python
> > > unicode/str in a narrow build without a fairly comprehensive library
> > > that wraps the arrays in operations that treat unicode/str as an array
> > > of code points.)
> >
> > That sounds like a contradiction -- it wouldn't be a UTF-16 array if
> > you couldn't tell that it was using UTF-16.
>
> Well, that's why I wrote "intended to be suggestive". The Unicode
> Standard does not specify at all what the internal representation of
> characters may be, it only specifies what their external behavior must
> be when two processes communicate. (For "process" as used in the
> standard, think "Python modules" here, since we are concerned with the
> problems of folks who develop in Python.) When observing the behavior
> of a Unicode process, there are no UTF-16 arrays or UTF-8 arrays or
> even UTF-32 arrays; only arrays of characters.
>
> Thus, according to the rules of handling a UTF-16 stream, it is an
> error to observe a lone surrogate or a surrogate pair that isn't a
> high-low pair (Unicode 6.0, Ch. 3 "Conformance", requirements C1 and
> C8-C10). That's what I mean by "can't tell it's UTF-16". And I
> understand those requirements to mean that operations on UTF-16
> streams should produce UTF-16 streams, or raise an error. Without
> that closure property for basic operations on str, I think it's a bad
> idea to say that the representation of text in a str in a pre-PEP-393
> "narrow" build is UTF-16. For many users and app developers, it
> creates expectations that are not fulfilled.
>
> It's true that common usage is that an array of code units that
> usually conforms to UTF-16 may be called "UTF-16" without the closure
> properties. I just disagree with that usage, because there are two
> camps that interpret "UTF-16" differently. One side says, "we have an
> array representation in UTF-16 that can handle all Unicode code points
> efficiently, and if you think you need more, think again", while the
> other says "it's too painful to have to check every result for valid
> UTF-16, and we need a UTF-16 type that supports the usual array
> operations on *characters* via the usual operators; if you think
> otherwise, think again."
>
> Note that despite the (presumed) resolution of the UTF-16 issue for
> CPython by PEP 393, at some point a very similar discussion will take
> place over "characters" anyway, because users and app developers are
> going to want a type that handles composition sequences and/or
> grapheme clusters for them, as well as comparison that respects
> canonical equivalence, even if it is inefficient compared to str.
> That's why I insisted on use of "array of code points" to describe the
> PEP 393 str type, rather than "array of characters".
On topic:
So from reading all this discussion, I think this point is rather a key
one... and it has been made repeatedly in different ways: Arrays are
not suitable for manipulating Unicode character sequences, and the str
type is an array with a veneer of text manipulation operations, which do
not, and cannot, by themselves, efficiently implement Unicode character
sequences.
Python wants to, should, and can implement UTF-16 streams, UTF-8
streams, and UTF-32 streams. It should, and can implement streams using
other encodings as well, and also binary streams.
Python wants to, should, and can implement 8-bit, 16-bit, 32-bit, and
64-bit arrays. These are efficient to access, index, and slice.
Python implements a veneer on some 8-bit, 16-bit, and 32-bit arrays
called str (this will be more true post-PEP 393, although it is true
with caveats presently), which interpret array elements as code units
(currently) or codepoints (post-PEP), and implements operations that are
interesting for text processing, with caveats.
There is presently no support for arrays of Unicode grapheme clusters or
composed characters. The Python type called str may or may not be
properly documented (to the extent that there is confusion between the
actual contents of the elements of the type, and the concept of
character as defined by Unicode). From comments Guido has made, he is
not interested in changing the efficiency or access methods of the str
type to raise the level of support of Unicode to the composed character,
or grapheme cluster concepts. The str type itself can presently be used
to process other character encodings: if they are fixed width < 32-bit
elements those encodings might be considered Unicode encodings, but
there is no requirement that they are, and some operations on str may
operate with knowledge of some Unicode semantics, so there are caveats.
So it seems that any semantics in support of composed characters,
grapheme clusters, or codepoints-stored-as-<32-bit-code-units, must be
created as either an add-on Python package (in Python) or C extension,
or a combination. It could be based on extensions to the existing str
type, or it could be based on the array type, or it could based on the
bytes type. It could use an internal format of 32-bit codepoints, PEP
393 variable-size codepoints, or 8- or 16-bit codeunits.
In addition to the expected stream operations, character length,
indexing, and slicing operations, additional more complex operations
would be expected on Unicode string values: regular expressions,
comparisons, collations, case-shifting, and perhaps more. RTL and LTR
awareness would add complexity to all operations, or at least variants
of all operations.
The questions are:
1) Is anyone interested in writing a PEP for such a thing?
2) Is anyone interested in writing an implementation for such a thing?
3) How many conflicting opinions and arguments will be spawned, making
the poor person or persons above lose interest?
Brainstorming ideas (which may wander off-topic in some regards, but
were all inspired by this discussion):
BI-0: Tom's analysis makes me think that the UTF-8 encoding, since it is
smallest on the average language, and an implementation based on a
foundation type of bytes or 'B' arrays, plus some side indexes of some
sort, could be an appropriate starting point. UTF-8 is variable length,
but so are composed characters and grapheme clusters. Building an array,
each of whose units could hold the largest grapheme cluster would seem
extremely inefficient, just like 32-bit Unicode is extremely inefficient
for dealing with ASCII, so variable length units seem to be an
imperative part of a solution. At least until one thinks up BI-2.
BI-1: Perhaps a 32-bit base, with the upper 11 bits used to cache
character characteristics from various character attribute database
lookups could be an effective alternative, but wouldn't eliminate the
need for dealing with variable length units for length, indexing, and
slicing operations.
BI-2: Maybe a 32-bit base would be useful so that one high bit could be
used to flag that this character position actually holds an index to a
multi-codepoint character, and the index would then hold the actual
codes for that character. This would allow for at most 2^31 (and memory
limited) different multi-codepoint characters in a string (or perhaps
per application, if the multi-codepoint characters are shared between
strings), but would suddenly allow array indexing of grapheme clusters
and composed characters... with double-indexing required for
multi-codepoint character access. [This idea seems similar to one that
was mentioned elsewhere in this thread, suggesting that private use
characters could be used to represent multi-codepoint characters, but
(a) doesn't infringe on private uses, and (b) allows for a greater
number of multi-codepoint characters to be used.]
BI-3: both BI-1 and BI-2 would also allow themselves to be built on top
of PEP 393 str... allowing multi-codepoint-character-supporting
applications to benefit from the space efficiencies of PEP 393 when no
multi-codepoint characters are fed into the application.
BI-4: Since Unicode has 21-bit codepoints, one wonders if 24-bit array
elements might be appropriate, rather than 32-bit. BI-2 could still
operate, with a theoretical reduction to 2^23 possible multi-codepoint
characters in an application. Access would be less efficient, but still
O(1), and 25% of the space would be saved. This idea could be applied
to PEP 393 independently of multi-codepoint character support.
BI-5: I'm pretty sure there are inappropriate or illegal sequences of
combining characters that should not stand alone. One example of this
is lone surrogates. Such characters out of an appropriate sequence
could be flagged with a high-bit so that they could be quickly
recognized as illegal Unicode, but codecs could be provided to allow
them to round-trip, and applications could recognize immediately that
they should be handled as "binary gibberish" in an otherwise Unicode
stream. This idea could be applied to PEP 393 independently of
additional multi-codepoint character support.
BI-6: Maybe another high bit could be used with a different codec error
handler instead of using lone surrogates when decoding
not-quite-conformant byte streams (such as OS filenames). Sad we didn't
think of this one before doing all the lone surrogate stuff. Of course,
this solution wouldn't work on narrow builds, because not even
surrogates can represent high bits above Unicode codepoints! But once
we have PEP 393, we _could_ replace inappropriate use of lone
surrogates, with use of out-of-the-Unicode-codepoint range integers,
without introducing ambiguity in the interpretation of lone surrogates.
This idea could be applied to PEP 393 independently of multi-codepoint
character support.
Glenn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20110831/7b5029ab/attachment-0001.html>
More information about the Python-Dev
mailing list