
On 9/9/2011 5:21 PM, Guido van Rossum wrote:
I, for one, am very interested. It sounds like the 'unicode' datatype in Jython does not in fact have O(1) indexing characteristics if the string contains any characters in the astral plane. Interesting. I wonder if you have heard from anyone about this affecting their app's performance?
--Guido
The question is whether or how often any Jython users are yet indexing/slicing long strings with astral chars. If a utf-8 xml file is directly parsed into a DOM, then the longest decoded strings will be 'paragraphs' that are seldom more than 1000 chars.
On Fri, Sep 9, 2011 at 12:58 PM, fwierzbicki@gmail.com <fwierzbicki@gmail.com> wrote:
On Fri, Sep 9, 2011 at 10:16 AM, Terry Reedy<tjreedy@udel.edu> wrote:
I am curious how you index by code point rather than code unit with 16-bit code units and how it compares with the method I posted. Is there anything I can read? Reply off list if you want. I'll post on-list until someone complains, just in case there are interested onlookers :)
There aren't docs, but the code is here: https://bitbucket.org/jython/jython/src/8a8642e45433/src/org/python/core/PyU...
Here are (I think) the most relevant bits for random access -- note that getString() returns the internal representation of the PyUnicode which is a java.lang.String
@Override protected PyObject pyget(int i) { if (isBasicPlane()) { return Py.makeCharacter(getString().charAt(i), true); }
This is O(1)
int k = 0; while (i> 0) { int W1 = getString().charAt(k); if (W1>= 0xD800&& W1< 0xDC00) { k += 2; } else { k += 1; } i--;
This is an O(n) linear scan.
} int codepoint = getString().codePointAt(k); return Py.makeCharacter(codepoint, true); }
Near the beginning of this thread, I described and gave a link to my O(logk) algorithm, where k is the number of supplementary ('astral') chars. It uses bisect.bisect_left on an int array of length k constructed with a linear scan much like the one above, with one added line. The basic idea is to do the linear scan just once and save the locations (code point indexes) of the astral chars instead of repeating the scan on every access. That could be done as the string is constructed. The same array search works for slicing too. Jython is welcome to use it if you ever decide you need it. I have in mind to someday do some timing tests with the Python version. I just do not know how closely results would be to those for compiled C or Java. -- Terry Jan Reedy