[pypy-dev] Interest in GSoC project: UTF-8 internal unicode storage

Armin Rigo arigo at tunes.org
Sat Mar 5 03:09:59 EST 2016

Hi Piotr,

Thanks for giving some serious thoughts to the utf8-stored unicode
string proposal!

On 5 March 2016 at 01:48, Piotr Jurkiewicz
<piotr.jerzy.jurkiewicz at gmail.com> wrote:
>     Random access would be as follows:
>         page_num, byte_in_page = divmod(codepoint_pos, 64)
>         page_start_byte = index[page_num]
>         exact_byte = seek_forward(buffer[page_start_byte], byte_in_page)
>         return buffer[exact_byte]

This is the part I'm least sure about: seek_forward() needs to be a
loop over 0 to 63 codepoints.  True, each loop can be branchless, and
very short---let's say 4 instructions.  But it still makes a total of
up to 252 instructions (plus the checks to know if we must go on).
These instructions are all or almost all dependent on the previous
one: you must have finished computing the length of one sequence to
even being computing the length of the next one.  Maybe it's faster to
use a more "XMM-izable" algorithm which counts 0 for each byte in
0x80-0xBF and 1 otherwise, and makes the sum.

There are also variants, e.g. adding a second array of words similar
to 'index', but where each word is 8 packed bytes giving 8 starting
points inside the page (each in range 0-252).  This would reduce the
walk to 0-7 codepoints.

I'm +1 on your proposal. The whole thing is definitely worth a try.

A bientôt,


