[pypy-dev] Interest in GSoC project: UTF-8 internal unicode storage

Mon Mar 7 04:33:19 EST 2016

Hi Piotr.

Any chance to have a chat with you about the proposal on a more
real-time communication medium like IRC or GChat? (it's #pypy on IRC
and use my mail for gchat)

On Sat, Mar 5, 2016 at 2:48 AM, Piotr Jurkiewicz
<piotr.jerzy.jurkiewicz at gmail.com> wrote:
> Hi PyPy devs,
>
> my name is Piotr Jurkiewicz and I am a first-year PhD student at
> the AGH University of Science and Technology, Kraków, Poland.
>
> I am writing this email to make sure that PyPy is going to
> participate in GSoC 2016, since I am interested in one of the
> proposed projects: Optimized Unicode Representation
>
> Below is a list of my ideas and plan for the project.
>
> (I use Python 2 nomenclature, that is unicode strings are
> `unicode` objects and bytes strings are `str` objects.)
>
> 1. Store all unicode objects contents internally as UTF-8.
>
>     This would reduce size of stored contents and allow external
>     libraries, which expect UTF-8, to process contents directly in the
>     memory (for example using various regexp libraries to search unicode
>     string).
>
> 2. Unify interning caches for str and unicode.
>
>     This would allow unicode objects and corresponding
>     utf8-encoded-str objects to share the same interned buffer.
>
>     For example unicode object u'koń' would share interned buffer
>     with str 'ko\xc5\x84'.
>
>     This would make unicode.encode('utf-8') basically no op. As UTF-8
>     becomes dominant encoding for any data exchange, including web (86%)
>     [1], more and more data coming out from Python scripts needs to be
>     UTF-8 encoded. Therefore, it is important to make this operation as
>     cheap as possible.
>
>     It would speed up str.decode('utf-8') significantly too, although it
>     wouldn't make it no op. String still would need to be checked if it
>     is a correct UTF-8 string when transforming to unicode object. But
>     we can get rid of additional allocation, copying string contents and
>     storing it twice, in CONST_STR_CACHE and CONST_UNICODE_CACHE.
>
> 3. Indexing of codepoints positions, what would allow O(1) random
>    access and slicing.
>
>     The idea is simple: alongside contents of each interned unicode
>     object, store an array of unsigned integers. These integers will
>     be positions (in bytes), counting from the beginning of the buffer,
>     at which each next 64-codepoint-long 'pages' start.
>
>     Random access would be as follows:
>
>         page_num, byte_in_page = divmod(codepoint_pos, 64)
>         page_start_byte = index[page_num]
>         exact_byte = seek_forward(buffer[page_start_byte], byte_in_page)
>         return buffer[exact_byte]
>
>     Using 64-byte long pages, like in the example above, would allow
>     O(1) random access, with constant terms of:
>
>     - one cache access in cases of only-ASCII texts (indexes for such
>       unicode objects will not be created and maintained)
>     - three cache accesses in cases of texts consisting of ASCII mixed
>       with two-byte characters (Latin, Greek, Cyrillic, Hebrew, Arabic
>       alphabets)
>     - four or five cache accesses in cases of texts consisting mostly of
>       three- and four- byte characters
>
>     (all above assuming 64-byte long CPU cache lines)
>
>     Memory overhead associated with storing index array would be in
>     range 0 - 6.25%. (or 0 - 12.5% if unicode objects longer than 2^32
>     codepoints will be allowed)
>
>     (assuming that the index array consists of integers of smallest
>     possible type which can store buffer_bytes_len - 1)
>
> 4. Fast codepoints counting/seeking with branchless algorithm [2].
>
>     When unicode object is interned, we are sure that it is a correct
>     UTF-8 string. Therefore, there is no need for correctness checking
>     when seeking, so a branchless algorithm can be used.
>
> [1]: http://w3techs.com/technologies/details/en-utf8/all/all
> [2]:
> http://blogs.perl.org/users/nick_wellnhofer/2015/04/branchless-utf-8-length.html
>
> All of these changes can be introduced one at a time, what would
> improve tracking of performance changes and debugging of eventual
> errors.
>
> After completing the project I plan to write a paper describing
> speedup method of random access unicode access based on indexing, as
> this method has a potential for being used in other language
> interpreters which have immutable and/or interned unicode strings.
> Note that similar index can be created for graphemes as well, so
> this method can be used in languages which provide grapheme-based
> interface (like Perl 6).
>
> Please share your thoughts about these ideas.
>
> Cheers,
> Piotr
> _______________________________________________
> pypy-dev mailing list
> pypy-dev at python.org
> https://mail.python.org/mailman/listinfo/pypy-dev