[Python-Dev] PEP 393 Summer of Code Project

Terry Reedy tjreedy at udel.edu
Fri Sep 9 19:16:17 CEST 2011


On 9/9/2011 12:12 PM, fwierzbicki at gmail.com wrote:
> On Thu, Sep 8, 2011 at 10:39 PM, Terry Reedy<tjreedy at udel.edu>  wrote:
>> On 9/8/2011 6:15 PM, fwierzbicki at gmail.com wrote:
>>>
>>> Oops, forgot to add the link for the gory details for Java and>    2 byte
>>> unicode:
>>>
>>> http://java.sun.com/developer/technicalArticles/Intl/Supplementary/
>>
>> This is dated 2004. Basically, they considered several options, tried out 4,
>> and ended up sticking with char[] (sequences) as UTF-16 with char = 16 bit
>> code unit and added 32-bit Character(int) class for low-level manipulation
>> of code points.
>>
>> I did not see the indexing problem mentioned. I get the impression that they
>> encourage sequence forward-backward iteration (cursor-based access) rather
>> than random-access indexing.
> Hmmm, sorry for the irrelevant link - my lack of expertise here is
> showing. What I do know is that we (meaning Jim Baker) are taking
> great pains to always use codepoints even for random access in our
> unicode code. I can't speak to the performance implications without
> some deeper study into what Jim has done.

I am curious how you index by code point rather than code unit with 
16-bit code units and how it compares with the method I posted. Is there 
anything I can read? Reply off list if you want.

-- 
Terry Jan Reedy



More information about the Python-Dev mailing list