[pypy-dev] Interest in GSoC project: UTF-8 internal unicode storage
Maciej Fijalkowski
fijall at gmail.com
Mon Mar 7 04:33:19 EST 2016
Hi Piotr.
Any chance to have a chat with you about the proposal on a more
real-time communication medium like IRC or GChat? (it's #pypy on IRC
and use my mail for gchat)
On Sat, Mar 5, 2016 at 2:48 AM, Piotr Jurkiewicz
<piotr.jerzy.jurkiewicz at gmail.com> wrote:
> Hi PyPy devs,
>
> my name is Piotr Jurkiewicz and I am a first-year PhD student at
> the AGH University of Science and Technology, Kraków, Poland.
>
> I am writing this email to make sure that PyPy is going to
> participate in GSoC 2016, since I am interested in one of the
> proposed projects: Optimized Unicode Representation
>
> Below is a list of my ideas and plan for the project.
>
> (I use Python 2 nomenclature, that is unicode strings are
> `unicode` objects and bytes strings are `str` objects.)
>
> 1. Store all unicode objects contents internally as UTF-8.
>
> This would reduce size of stored contents and allow external
> libraries, which expect UTF-8, to process contents directly in the
> memory (for example using various regexp libraries to search unicode
> string).
>
> 2. Unify interning caches for str and unicode.
>
> This would allow unicode objects and corresponding
> utf8-encoded-str objects to share the same interned buffer.
>
> For example unicode object u'koń' would share interned buffer
> with str 'ko\xc5\x84'.
>
> This would make unicode.encode('utf-8') basically no op. As UTF-8
> becomes dominant encoding for any data exchange, including web (86%)
> [1], more and more data coming out from Python scripts needs to be
> UTF-8 encoded. Therefore, it is important to make this operation as
> cheap as possible.
>
> It would speed up str.decode('utf-8') significantly too, although it
> wouldn't make it no op. String still would need to be checked if it
> is a correct UTF-8 string when transforming to unicode object. But
> we can get rid of additional allocation, copying string contents and
> storing it twice, in CONST_STR_CACHE and CONST_UNICODE_CACHE.
>
> 3. Indexing of codepoints positions, what would allow O(1) random
> access and slicing.
>
> The idea is simple: alongside contents of each interned unicode
> object, store an array of unsigned integers. These integers will
> be positions (in bytes), counting from the beginning of the buffer,
> at which each next 64-codepoint-long 'pages' start.
>
> Random access would be as follows:
>
> page_num, byte_in_page = divmod(codepoint_pos, 64)
> page_start_byte = index[page_num]
> exact_byte = seek_forward(buffer[page_start_byte], byte_in_page)
> return buffer[exact_byte]
>
> Using 64-byte long pages, like in the example above, would allow
> O(1) random access, with constant terms of:
>
> - one cache access in cases of only-ASCII texts (indexes for such
> unicode objects will not be created and maintained)
> - three cache accesses in cases of texts consisting of ASCII mixed
> with two-byte characters (Latin, Greek, Cyrillic, Hebrew, Arabic
> alphabets)
> - four or five cache accesses in cases of texts consisting mostly of
> three- and four- byte characters
>
> (all above assuming 64-byte long CPU cache lines)
>
> Memory overhead associated with storing index array would be in
> range 0 - 6.25%. (or 0 - 12.5% if unicode objects longer than 2^32
> codepoints will be allowed)
>
> (assuming that the index array consists of integers of smallest
> possible type which can store buffer_bytes_len - 1)
>
> 4. Fast codepoints counting/seeking with branchless algorithm [2].
>
> When unicode object is interned, we are sure that it is a correct
> UTF-8 string. Therefore, there is no need for correctness checking
> when seeking, so a branchless algorithm can be used.
>
> [1]: http://w3techs.com/technologies/details/en-utf8/all/all
> [2]:
> http://blogs.perl.org/users/nick_wellnhofer/2015/04/branchless-utf-8-length.html
>
> All of these changes can be introduced one at a time, what would
> improve tracking of performance changes and debugging of eventual
> errors.
>
> After completing the project I plan to write a paper describing
> speedup method of random access unicode access based on indexing, as
> this method has a potential for being used in other language
> interpreters which have immutable and/or interned unicode strings.
> Note that similar index can be created for graphemes as well, so
> this method can be used in languages which provide grapheme-based
> interface (like Perl 6).
>
> Please share your thoughts about these ideas.
>
> Cheers,
> Piotr
> _______________________________________________
> pypy-dev mailing list
> pypy-dev at python.org
> https://mail.python.org/mailman/listinfo/pypy-dev
More information about the pypy-dev
mailing list