
I, for one, am very interested. It sounds like the 'unicode' datatype in Jython does not in fact have O(1) indexing characteristics if the string contains any characters in the astral plane. Interesting. I wonder if you have heard from anyone about this affecting their app's performance? --Guido On Fri, Sep 9, 2011 at 12:58 PM, fwierzbicki@gmail.com <fwierzbicki@gmail.com> wrote:
On Fri, Sep 9, 2011 at 10:16 AM, Terry Reedy <tjreedy@udel.edu> wrote:
I am curious how you index by code point rather than code unit with 16-bit code units and how it compares with the method I posted. Is there anything I can read? Reply off list if you want. I'll post on-list until someone complains, just in case there are interested onlookers :)
There aren't docs, but the code is here: https://bitbucket.org/jython/jython/src/8a8642e45433/src/org/python/core/PyU...
Here are (I think) the most relevant bits for random access -- note that getString() returns the internal representation of the PyUnicode which is a java.lang.String
@Override protected PyObject pyget(int i) { if (isBasicPlane()) { return Py.makeCharacter(getString().charAt(i), true); }
int k = 0; while (i > 0) { int W1 = getString().charAt(k); if (W1 >= 0xD800 && W1 < 0xDC00) { k += 2; } else { k += 1; } i--; } int codepoint = getString().codePointAt(k); return Py.makeCharacter(codepoint, true); }
public boolean isBasicPlane() { if (plane == Plane.BASIC) { return true; } else if (plane == Plane.UNKNOWN) { plane = (getString().length() == getCodePointCount()) ? Plane.BASIC : Plane.ASTRAL; } return plane == Plane.BASIC; }
public int getCodePointCount() { if (codePointCount >= 0) { return codePointCount; } codePointCount = getString().codePointCount(0, getString().length()); return codePointCount; }
-Frank _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/guido%40python.org
-- --Guido van Rossum (python.org/~guido)