[pypy-dev] Interest in GSoC project: UTF-8 internal unicode storage
arigo at tunes.org
Sat Mar 5 03:09:59 EST 2016
Thanks for giving some serious thoughts to the utf8-stored unicode
On 5 March 2016 at 01:48, Piotr Jurkiewicz
<piotr.jerzy.jurkiewicz at gmail.com> wrote:
> Random access would be as follows:
> page_num, byte_in_page = divmod(codepoint_pos, 64)
> page_start_byte = index[page_num]
> exact_byte = seek_forward(buffer[page_start_byte], byte_in_page)
> return buffer[exact_byte]
This is the part I'm least sure about: seek_forward() needs to be a
loop over 0 to 63 codepoints. True, each loop can be branchless, and
very short---let's say 4 instructions. But it still makes a total of
up to 252 instructions (plus the checks to know if we must go on).
These instructions are all or almost all dependent on the previous
one: you must have finished computing the length of one sequence to
even being computing the length of the next one. Maybe it's faster to
use a more "XMM-izable" algorithm which counts 0 for each byte in
0x80-0xBF and 1 otherwise, and makes the sum.
There are also variants, e.g. adding a second array of words similar
to 'index', but where each word is 8 packed bytes giving 8 starting
points inside the page (each in range 0-252). This would reduce the
walk to 0-7 codepoints.
I'm +1 on your proposal. The whole thing is definitely worth a try.
More information about the pypy-dev