[pypy-dev] Interest in GSoC project: UTF-8 internal unicode storage
piotr.jerzy.jurkiewicz at gmail.com
Fri Mar 4 19:48:41 EST 2016
Hi PyPy devs,
my name is Piotr Jurkiewicz and I am a first-year PhD student at
the AGH University of Science and Technology, Kraków, Poland.
I am writing this email to make sure that PyPy is going to
participate in GSoC 2016, since I am interested in one of the
proposed projects: Optimized Unicode Representation
Below is a list of my ideas and plan for the project.
(I use Python 2 nomenclature, that is unicode strings are
`unicode` objects and bytes strings are `str` objects.)
1. Store all unicode objects contents internally as UTF-8.
This would reduce size of stored contents and allow external
libraries, which expect UTF-8, to process contents directly in the
memory (for example using various regexp libraries to search unicode
2. Unify interning caches for str and unicode.
This would allow unicode objects and corresponding
utf8-encoded-str objects to share the same interned buffer.
For example unicode object u'koń' would share interned buffer
with str 'ko\xc5\x84'.
This would make unicode.encode('utf-8') basically no op. As UTF-8
becomes dominant encoding for any data exchange, including web (86%)
, more and more data coming out from Python scripts needs to be
UTF-8 encoded. Therefore, it is important to make this operation as
cheap as possible.
It would speed up str.decode('utf-8') significantly too, although it
wouldn't make it no op. String still would need to be checked if it
is a correct UTF-8 string when transforming to unicode object. But
we can get rid of additional allocation, copying string contents and
storing it twice, in CONST_STR_CACHE and CONST_UNICODE_CACHE.
3. Indexing of codepoints positions, what would allow O(1) random
access and slicing.
The idea is simple: alongside contents of each interned unicode
object, store an array of unsigned integers. These integers will
be positions (in bytes), counting from the beginning of the buffer,
at which each next 64-codepoint-long 'pages' start.
Random access would be as follows:
page_num, byte_in_page = divmod(codepoint_pos, 64)
page_start_byte = index[page_num]
exact_byte = seek_forward(buffer[page_start_byte], byte_in_page)
Using 64-byte long pages, like in the example above, would allow
O(1) random access, with constant terms of:
- one cache access in cases of only-ASCII texts (indexes for such
unicode objects will not be created and maintained)
- three cache accesses in cases of texts consisting of ASCII mixed
with two-byte characters (Latin, Greek, Cyrillic, Hebrew, Arabic
- four or five cache accesses in cases of texts consisting mostly of
three- and four- byte characters
(all above assuming 64-byte long CPU cache lines)
Memory overhead associated with storing index array would be in
range 0 - 6.25%. (or 0 - 12.5% if unicode objects longer than 2^32
codepoints will be allowed)
(assuming that the index array consists of integers of smallest
possible type which can store buffer_bytes_len - 1)
4. Fast codepoints counting/seeking with branchless algorithm .
When unicode object is interned, we are sure that it is a correct
UTF-8 string. Therefore, there is no need for correctness checking
when seeking, so a branchless algorithm can be used.
All of these changes can be introduced one at a time, what would
improve tracking of performance changes and debugging of eventual
After completing the project I plan to write a paper describing
speedup method of random access unicode access based on indexing, as
this method has a potential for being used in other language
interpreters which have immutable and/or interned unicode strings.
Note that similar index can be created for graphemes as well, so
this method can be used in languages which provide grapheme-based
interface (like Perl 6).
Please share your thoughts about these ideas.
More information about the pypy-dev