[pypy-dev] Interest in GSoC project: UTF-8 internal unicode storage

Fri Mar 4 19:48:41 EST 2016

Hi PyPy devs,

my name is Piotr Jurkiewicz and I am a first-year PhD student at
the AGH University of Science and Technology, Kraków, Poland.

I am writing this email to make sure that PyPy is going to
participate in GSoC 2016, since I am interested in one of the
proposed projects: Optimized Unicode Representation

Below is a list of my ideas and plan for the project.

(I use Python 2 nomenclature, that is unicode strings are
`unicode` objects and bytes strings are `str` objects.)

1. Store all unicode objects contents internally as UTF-8.

     This would reduce size of stored contents and allow external
     libraries, which expect UTF-8, to process contents directly in the
     memory (for example using various regexp libraries to search unicode
     string).

2. Unify interning caches for str and unicode.

     This would allow unicode objects and corresponding
     utf8-encoded-str objects to share the same interned buffer.

     For example unicode object u'koń' would share interned buffer
     with str 'ko\xc5\x84'.

     This would make unicode.encode('utf-8') basically no op. As UTF-8
     becomes dominant encoding for any data exchange, including web (86%)
     [1], more and more data coming out from Python scripts needs to be
     UTF-8 encoded. Therefore, it is important to make this operation as
     cheap as possible.

     It would speed up str.decode('utf-8') significantly too, although it
     wouldn't make it no op. String still would need to be checked if it
     is a correct UTF-8 string when transforming to unicode object. But
     we can get rid of additional allocation, copying string contents and
     storing it twice, in CONST_STR_CACHE and CONST_UNICODE_CACHE.

3. Indexing of codepoints positions, what would allow O(1) random
    access and slicing.

     The idea is simple: alongside contents of each interned unicode
     object, store an array of unsigned integers. These integers will
     be positions (in bytes), counting from the beginning of the buffer,
     at which each next 64-codepoint-long 'pages' start.

     Random access would be as follows:

         page_num, byte_in_page = divmod(codepoint_pos, 64)
         page_start_byte = index[page_num]
         exact_byte = seek_forward(buffer[page_start_byte], byte_in_page)
         return buffer[exact_byte]

     Using 64-byte long pages, like in the example above, would allow
     O(1) random access, with constant terms of:

     - one cache access in cases of only-ASCII texts (indexes for such
       unicode objects will not be created and maintained)
     - three cache accesses in cases of texts consisting of ASCII mixed
       with two-byte characters (Latin, Greek, Cyrillic, Hebrew, Arabic
       alphabets)
     - four or five cache accesses in cases of texts consisting mostly of
       three- and four- byte characters

     (all above assuming 64-byte long CPU cache lines)

     Memory overhead associated with storing index array would be in
     range 0 - 6.25%. (or 0 - 12.5% if unicode objects longer than 2^32
     codepoints will be allowed)

     (assuming that the index array consists of integers of smallest
     possible type which can store buffer_bytes_len - 1)

4. Fast codepoints counting/seeking with branchless algorithm [2].

     When unicode object is interned, we are sure that it is a correct
     UTF-8 string. Therefore, there is no need for correctness checking
     when seeking, so a branchless algorithm can be used.

[1]: http://w3techs.com/technologies/details/en-utf8/all/all
[2]: 
http://blogs.perl.org/users/nick_wellnhofer/2015/04/branchless-utf-8-length.html

All of these changes can be introduced one at a time, what would
improve tracking of performance changes and debugging of eventual
errors.

After completing the project I plan to write a paper describing
speedup method of random access unicode access based on indexing, as
this method has a potential for being used in other language
interpreters which have immutable and/or interned unicode strings.
Note that similar index can be created for graphemes as well, so
this method can be used in languages which provide grapheme-based
interface (like Perl 6).

Please share your thoughts about these ideas.

Cheers,
Piotr