
On Fri, Sep 9, 2011 at 10:16 AM, Terry Reedy <tjreedy@udel.edu> wrote:
I am curious how you index by code point rather than code unit with 16-bit code units and how it compares with the method I posted. Is there anything I can read? Reply off list if you want. I'll post on-list until someone complains, just in case there are interested onlookers :)
There aren't docs, but the code is here: https://bitbucket.org/jython/jython/src/8a8642e45433/src/org/python/core/PyU... Here are (I think) the most relevant bits for random access -- note that getString() returns the internal representation of the PyUnicode which is a java.lang.String @Override protected PyObject pyget(int i) { if (isBasicPlane()) { return Py.makeCharacter(getString().charAt(i), true); } int k = 0; while (i > 0) { int W1 = getString().charAt(k); if (W1 >= 0xD800 && W1 < 0xDC00) { k += 2; } else { k += 1; } i--; } int codepoint = getString().codePointAt(k); return Py.makeCharacter(codepoint, true); } public boolean isBasicPlane() { if (plane == Plane.BASIC) { return true; } else if (plane == Plane.UNKNOWN) { plane = (getString().length() == getCodePointCount()) ? Plane.BASIC : Plane.ASTRAL; } return plane == Plane.BASIC; } public int getCodePointCount() { if (codePointCount >= 0) { return codePointCount; } codePointCount = getString().codePointCount(0, getString().length()); return codePointCount; } -Frank