Re: [Python-Dev] PEP 393 Summer of Code Project

Sept. 9, 2011

      On 9/9/2011 5:21 PM, Guido van Rossum wrote:
...
I, for one, am very interested. It sounds like the 'unicode' datatype
in Jython does not in fact have O(1) indexing characteristics if the
string contains any characters in the astral plane. Interesting. I
wonder if you have heard from anyone about this affecting their app's
performance?
--Guido
The question is whether or how often any Jython users are yet 
indexing/slicing long strings with astral chars. If a utf-8 xml file is 
directly parsed into a DOM, then the longest decoded strings will be 
'paragraphs' that are seldom more than 1000 chars.
...
On Fri, Sep 9, 2011 at 12:58 PM, fwierzbicki@gmail.com
<fwierzbicki@gmail.com>  wrote:
...
On Fri, Sep 9, 2011 at 10:16 AM, Terry Reedy<tjreedy@udel.edu>  wrote:
...
I am curious how you index by code point rather than code unit with 16-bit
code units and how it compares with the method I posted. Is there anything I
can read? Reply off list if you want.
I'll post on-list until someone complains, just in case there are
interested onlookers :)
There aren't docs, but the code is here:
https://bitbucket.org/jython/jython/src/8a8642e45433/src/org/python/core/PyU...
Here are (I think) the most relevant bits for random access -- note
that getString() returns the internal representation of the PyUnicode
which is a java.lang.String
@Override
    protected PyObject pyget(int i) {
        if (isBasicPlane()) {
            return Py.makeCharacter(getString().charAt(i), true);
        }
This is O(1)
...
...
int k = 0;
        while (i>  0) {
            int W1 = getString().charAt(k);
            if (W1>= 0xD800&&  W1<  0xDC00) {
                k += 2;
            } else {
                k += 1;
            }
            i--;
This is an O(n) linear scan.
...
...
}
        int codepoint = getString().codePointAt(k);
        return Py.makeCharacter(codepoint, true);
    }
Near the beginning of this thread, I described and gave a link to my 
O(logk) algorithm, where k is the number of supplementary ('astral') 
chars. It uses bisect.bisect_left on an int array of length k 
constructed with a linear scan much like the one above, with one added 
line. The basic idea is to do the linear scan just once and save the 
locations (code point indexes) of the astral chars instead of repeating 
the scan on every access. That could be done as the string is 
constructed. The same array search works for slicing too. Jython is 
welcome to use it if you ever decide you need it.

I have in mind to someday do some timing tests with the Python version. 
I just do not know how closely results would be to those for compiled C 
or Java.

-- 
Terry Jan Reedy