[Python-Dev] PEP 393 Summer of Code Project

Wed Aug 31 21:15:12 CEST 2011

On 8/31/2011 5:21 AM, Stephen J. Turnbull wrote:
> Glenn Linderman writes:
>
>   >   From comments Guido has made, he is not interested in changing the
>   >  efficiency or access methods of the str type to raise the level of
>   >  support of Unicode to the composed character, or grapheme cluster
>   >  concepts.
>
> IMO, that would be a bad idea,

OK you agree with Guido.

> as higher-level Unicode support should
> either be a wrapper around full implementations such as ICU (or
> platform support in .NET or Java), or written in pure Python at first.
> Thus there is a need for an efficient array of code units type.  PEP
> 393 allows this to go to the level of code points, but evidently that
> is inappropriate for Jython and IronPython.
>
>   >  The str type itself can presently be used to process other
>   >  character encodings:
>
> Not really.  Remember, on input codecs always decode to Unicode and on
> output they always encode from Unicode.  How do you propose to get
> other encodings into the array of code units?

Here are two ways, there may be more: custom codecs, direct assignment

>   >  [A "true Unicode" type] could be based on extensions to the
>   >  existing str type, or it could be based on the array type, or it
>   >  could based on the bytes type.  It could use an internal format of
>   >  32-bit codepoints, PEP 393 variable-size codepoints, or 8- or
>   >  16-bit codeunits.
>
> In theory yes, but in practice all of the string methods and libraries
> like re operate on str (and often but not always bytes; in particular,
> codecs always decode from byte and encode to bytes).
>
> Why bother with anything except arrays of code points at the start?
> PEP 393 makes that time-efficient and reasonably space-efficient as a
> starting point and allows starting with re or MRAB's regex to get
> basic RE functionality or good UTS #18 functionality respectively.
> Plus str already has all the usual string operations (.startswith(),
> .join(), etc), and we have modules for dealing with the Unicode
> Character Database.  Why waste effort reintegrating with all that,
> until we have common use cases that need more efficient representation?

String methods could be reimplemented on any appropriate type, of 
course.  Rejecting alternatives too soon might make one miss the best 
design.

> There would be some issue in coming up with an appropriate UTF-16 to
> code point API for Jython and IronPython, but Terry Reedy has a rather
> efficient library for that already.

Yes, Terry's implementation is interesting, and inspiring, and that 
concept could be extended to a variety of interesting techniques: 
codepoint access of code unit representations, and multi-codepoint 
character access on top of either code unit or codepoint representations.

> So this discussion of alternative representations, including use of
> high bits to represent properties, is premature optimization
> ... especially since we don't even have a proto-PEP specifying how
> much conformance we want of this new "true Unicode" type in the first
> place.
>
> We need to focus on that before optimizing anything.

You may call it premature optimization if you like, or you can ignore 
the concepts and emails altogether.  I call it brainstorming for ideas, 
looking for non-obvious solutions to the problem of representation of 
Unicode.

I found your discussion of streams versus arrays, as separate concepts 
related to Unicode, along with Terry's bisect indexing implementation, 
to rather inspiring.  Just because Unicode defines streams of codeunits 
of various sizes (UTF-8, UTF-16, UTF-32) to represent characters when 
processes communicate and for storage (which is one way processes 
communicate), that doesn't imply that the internal representation of 
character strings in a programming language must use exactly that 
representation.  While there are efficiencies in using the same 
representation as is used by the communications streams, there are also 
inefficiencies.  I'm unaware of any current Python implementation that 
has chosen to use UTF-8 as the internal representation of character 
strings (I'm also aware Perl has made that choice), yet UTF-8 is one of 
the commonly recommend character representations on the Linux platform, 
from what I read.  So in that sense, Python has rejected the idea of 
using the "native" or "OS configured" representation as its internal 
representation.  So why, then, must one choose from a repertoire of 
Unicode-defined stream representations if they don't meet the goal of 
efficient length, indexing, or slicing operations on actual characters?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20110831/3f5a0d0d/attachment.html>