[Python-Dev] PEP 393 Summer of Code Project

Guido van Rossum guido at python.org
Fri Sep 9 23:21:33 CEST 2011


I, for one, am very interested. It sounds like the 'unicode' datatype
in Jython does not in fact have O(1) indexing characteristics if the
string contains any characters in the astral plane. Interesting. I
wonder if you have heard from anyone about this affecting their app's
performance?

--Guido

On Fri, Sep 9, 2011 at 12:58 PM, fwierzbicki at gmail.com
<fwierzbicki at gmail.com> wrote:
> On Fri, Sep 9, 2011 at 10:16 AM, Terry Reedy <tjreedy at udel.edu> wrote:
>
>> I am curious how you index by code point rather than code unit with 16-bit
>> code units and how it compares with the method I posted. Is there anything I
>> can read? Reply off list if you want.
> I'll post on-list until someone complains, just in case there are
> interested onlookers :)
>
> There aren't docs, but the code is here:
> https://bitbucket.org/jython/jython/src/8a8642e45433/src/org/python/core/PyUnicode.java
>
> Here are (I think) the most relevant bits for random access -- note
> that getString() returns the internal representation of the PyUnicode
> which is a java.lang.String
>
>    @Override
>    protected PyObject pyget(int i) {
>        if (isBasicPlane()) {
>            return Py.makeCharacter(getString().charAt(i), true);
>        }
>
>        int k = 0;
>        while (i > 0) {
>            int W1 = getString().charAt(k);
>            if (W1 >= 0xD800 && W1 < 0xDC00) {
>                k += 2;
>            } else {
>                k += 1;
>            }
>            i--;
>        }
>        int codepoint = getString().codePointAt(k);
>        return Py.makeCharacter(codepoint, true);
>    }
>
>    public boolean isBasicPlane() {
>        if (plane == Plane.BASIC) {
>            return true;
>        } else if (plane == Plane.UNKNOWN) {
>            plane = (getString().length() == getCodePointCount()) ?
> Plane.BASIC : Plane.ASTRAL;
>        }
>        return plane == Plane.BASIC;
>    }
>
>    public int getCodePointCount() {
>        if (codePointCount >= 0) {
>            return codePointCount;
>        }
>        codePointCount = getString().codePointCount(0, getString().length());
>        return codePointCount;
>    }
>
> -Frank
> _______________________________________________
> Python-Dev mailing list
> Python-Dev at python.org
> http://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe: http://mail.python.org/mailman/options/python-dev/guido%40python.org
>



-- 
--Guido van Rossum (python.org/~guido)


More information about the Python-Dev mailing list