[Python-Dev] PEP 393 Summer of Code Project

Wed Aug 24 20:52:51 CEST 2011

On 8/24/2011 9:00 AM, Stefan Behnel wrote:
> Nick Coghlan, 24.08.2011 15:06:
>> On Wed, Aug 24, 2011 at 10:46 AM, Terry Reedy wrote:
>>> In utf16.py, attached to http://bugs.python.org/issue12729
>>> I propose for consideration a prototype of different solution to the 
>>> 'mostly
>>> BMP chars, few non-BMP chars' case. Rather than expand every 
>>> character from
>>> 2 bytes to 4, attach an array cpdex of character (ie code point, not 
>>> code
>>> unit) indexes. Then for indexing and slicing, the correction is simple,
>>> simpler than I first expected:
>>>   code-unit-index = char-index + bisect.bisect_left(cpdex, char_index)
>>> where code-unit-index is the adjusted index into the full underlying
>>> double-byte array. This adds a time penalty of log2(len(cpdex)), but 
>>> avoids
>>> most of the space penalty and the consequent time penalty of moving 
>>> more
>>> bytes around and increasing cache misses.
>>
>> Interesting idea, but putting on my C programmer hat, I say -1.
>>
>> Non-uniform cell size = not a C array = standard C array manipulation
>> idioms don't work = pain (no matter how simple the index correction
>> happens to be).
>>
>> The nice thing about PEP 383 is that it gives us the smallest storage
>> array that is both an ordinary C array and has sufficiently large
>> individual elements to handle every character in the string.
>
> +1 

Yes, this sounds like a nice benefit, but the problem is it is false.  
The correct statement would be:

The nice thing about PEP 383 is that it gives us the smallest storage
array that is both an ordinary C array and has sufficiently large
individual elements to handle every Unicode codepoint in the string.

As Tom eloquently describes in the referenced issue (is Tom ever 
non-eloquent?), not all characters can be represented in a single codepoint.

It seems there are three concepts in Unicode, code units, codepoints, 
and characters, none of which are equivalent (and the first of which 
varies according to the encoding).  It also seems (to me) that Unicode 
has failed in its original premise, of being an easy way to handle "big 
char" for "all languages" with fixed size elements, but it is not clear 
that its original premise is achievable regardless of the size of "big 
char", when mixed directionality is desired, and it seems that support 
of some single languages require mixed directionality, not to mention 
mixed language support.

Given the required variability of character size in all presently 
Unicode defined encodings, I tend to agree with Tom that UTF-8, together 
with some technique of translating character index to code unit offset, 
may provide the best overall space utilization, and adequate CPU 
efficiency.  On the other hand, there are large subsets of applications 
that simply do not require support for bidirectional text or composed 
characters, and for those that do not, it remains to be seen if the 
price to be paid for supporting those features is too high a price for 
such applications. So far, we don't have implementations to benchmark to 
figure that out!

What does this mean for Python?  Well, if Python is willing to limit its 
support for applications to the subset for which the "big char" solution 
sufficient, then PEP 393 provides a way to do that, that looks to be 
pretty effective for reducing memory consumption for those applications 
that use short strings most of which can be classified by content into 
the 1 byte or 2 byte representations.  Applications that support long 
strings are more likely to bitten by the occasional "outlier" character 
that is longer than the average character, doubling or quadrupling the 
space needed to represent such strings, and eliminating a significant 
portion of the space savings the PEP is providing for other 
applications.  Benchmarks may or may not fully reflect the actual 
requirements of all applications, so conclusions based on benchmarking 
can easily be blind-sided the realities of other applications, unless 
the benchmarks are carefully constructed.

It is possible that the ideas in PEP 393, with its support for multiple 
underlying representations, could be the basis for some more complex 
representations that would better support characters rather than only 
supporting code points, but Martin has stated he is not open to 
additional representations, so the PEP itself cannot be that basis 
(although with care which may or may not be taken in the implementation 
of the PEP, the implementation may still provide that basis).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20110824/742d60db/attachment.html>