[Python-Dev] PEP 393 Summer of Code Project

Wed Aug 31 21:14:25 CEST 2011

On 8/31/2011 11:56 AM, Guido van Rossum wrote:
> On Wed, Aug 31, 2011 at 11:51 AM, Glenn Linderman 
> <v+python at g.nevcal.com <mailto:v%2Bpython at g.nevcal.com>> wrote:
>
>     On 8/31/2011 10:12 AM, Guido van Rossum wrote:
>>     On Wed, Aug 31, 2011 at 1:09 AM, Glenn Linderman<v+python at g.nevcal.com>  <mailto:v+python at g.nevcal.com>  wrote:
>>>     So from reading all this discussion, I think this point is rather a key
>>>     one... and it has been made repeatedly in different ways:  Arrays are not
>>>     suitable for manipulating Unicode character sequences, and the str type is
>>>     an array with a veneer of text manipulation operations, which do not, and
>>>     cannot, by themselves, efficiently implement Unicode character sequences.
>>     I think this is too strong. The str type is indeed an array, and you
>>     can build useful Unicode manipulation APIs on top of it. Just like
>>     bytes are not UTF-8, but can be used to represent UTF-8 and a
>>     fully-compliant UTF-8 codec can be implemented on top of it.
>>
>
>     This statement is a logical conclusion of arguments presented in
>     this thread.
>
>     1) Applications that wish to do grapheme access, wish to do it by
>     grapheme array indexing, because that is the efficient way to do it.
>
>
> I don't believe that should be taken as gospel. In Perl, they don't do 
> array indexing on strings at all, and use regex matching instead. An 
> API that uses some kind of cursor on a string might work fine in 
> Python too (for grapheme matching).

The last benchmark I saw, regexp in Perl is faster than regexp in 
Python; that was some years back, before regexp in Perl supported quite 
as much Unicode as it does now; not sure if someone has done a recent 
performance benchmarks; Tom's survey indicates that the functionality 
presently differs, so it is not clear if performance benchmarks are 
presently appropriate to attempt to measure Unicode operations in regexp 
between the two languages.

That said, regexp, or some sort of cursor on a string, might be a 
workable solution.  Will it have adequate performance?  Perhaps, at 
least for some applications.  Will it be as conceptually simple as 
indexing an array of graphemes?  No.  Will it ever reach the efficiency 
of indexing an array of graphemes? No.  Does that matter? Depends on the 
application.

>
>     2) As long as str is restricted to holding Unicode code units or
>     code points, then it cannot support grapheme array indexing
>     efficiently.
>
>     I  have not declared that useful Unicode manipulations APIs cannot
>     be built on top of str, only that efficiency will suffer.
>
>
> But you have not proven it.

Do you disagree that indexing an array is more efficient than 
manipulating strings with regex or binary trees?  I think not, because 
you are insistent that array indexing of str be preserved as O(1).  I 
agree that I have not proven it; it largely depends on whether or not 
indexing by grapheme cluster is a useful operation in applications.  Yet 
Stephen (I think) has commented that emacs performance goes down as soon 
as multi-byte characters are introduced into an edit buffer.  So I think 
he has proven that efficiency can suffer, in some 
implementations/applications.  Terry's O(k) implementation requires data 
beyond strings, and isn't O(1).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20110831/af49e07a/attachment.html>