[Python-Dev] More compact dictionaries with faster iteration

Wed May 15 13:32:39 CEST 2013

Hi Raymond,

On 08.01.13 15:49, Maciej Fijalkowski wrote:
> On Mon, Dec 10, 2012 at 3:44 AM, Raymond Hettinger
> <raymond.hettinger at gmail.com> wrote:
>> The current memory layout for dictionaries is
>> unnecessarily inefficient.  It has a sparse table of
>> 24-byte entries containing the hash value, key pointer,
>> and value pointer.
>>
>> Instead, the 24-byte entries should be stored in a
>> dense table referenced by a sparse table of indices.
>>
>> For example, the dictionary:
>>
>>      d = {'timmy': 'red', 'barry': 'green', 'guido': 'blue'}
>>
>> is currently stored as:
>>
>>      entries = [['--', '--', '--'],
>>                 [-8522787127447073495, 'barry', 'green'],
>>                 ['--', '--', '--'],
>>                 ['--', '--', '--'],
>>                 ['--', '--', '--'],
>>                 [-9092791511155847987, 'timmy', 'red'],
>>                 ['--', '--', '--'],
>>                 [-6480567542315338377, 'guido', 'blue']]
>>
>> Instead, the data should be organized as follows:
>>
>>      indices =  [None, 1, None, None, None, 0, None, 2]
>>      entries =  [[-9092791511155847987, 'timmy', 'red'],
>>                  [-8522787127447073495, 'barry', 'green'],
>>                  [-6480567542315338377, 'guido', 'blue']]
>>
>> Only the data layout needs to change.  The hash table
>> algorithms would stay the same.  All of the current
>> optimizations would be kept, including key-sharing
>> dicts and custom lookup functions for string-only
>> dicts.  There is no change to the hash functions, the
>> table search order, or collision statistics.
>>
>> The memory savings are significant (from 30% to 95%
>> compression depending on the how full the table is).
>> Small dicts (size 0, 1, or 2) get the most benefit.
>>
>> For a sparse table of size t with n entries, the sizes are:
>>
>>      curr_size = 24 * t
>>      new_size = 24 * n + sizeof(index) * t
>>
>> In the above timmy/barry/guido example, the current
>> size is 192 bytes (eight 24-byte entries) and the new
>> size is 80 bytes (three 24-byte entries plus eight
>> 1-byte indices).  That gives 58% compression.
>>
>> Note, the sizeof(index) can be as small as a single
>> byte for small dicts, two bytes for bigger dicts and
>> up to sizeof(Py_ssize_t) for huge dict.
>>
>> In addition to space savings, the new memory layout
>> makes iteration faster.  Currently, keys(), values, and
>> items() loop over the sparse table, skipping-over free
>> slots in the hash table.  Now, keys/values/items can
>> loop directly over the dense table, using fewer memory
>> accesses.
>>
>> Another benefit is that resizing is faster and
>> touches fewer pieces of memory.  Currently, every
>> hash/key/value entry is moved or copied during a
>> resize.  In the new layout, only the indices are
>> updated.  For the most part, the hash/key/value entries
>> never move (except for an occasional swap to fill a
>> hole left by a deletion).
>>
>> With the reduced memory footprint, we can also expect
>> better cache utilization.
>>
>> For those wanting to experiment with the design,
>> there is a pure Python proof-of-concept here:
>>
>>     http://code.activestate.com/recipes/578375
>>
>> YMMV: Keep in mind that the above size statics assume a
>> build with 64-bit Py_ssize_t and 64-bit pointers.  The
>> space savings percentages are a bit different on other
>> builds.  Also, note that in many applications, the size
>> of the data dominates the size of the container (i.e.
>> the weight of a bucket of water is mostly the water,
>> not the bucket).
>>
>>
>> Raymond
>> _______________________________________________
>> Python-Dev mailing list
>> Python-Dev at python.org
>> http://mail.python.org/mailman/listinfo/python-dev
>> Unsubscribe: http://mail.python.org/mailman/options/python-dev/fijall%40gmail.com
> One question Raymond.
>
> The compression ratios stay true provided you don't overallocate entry
> list. If you do overallocate you don't really gain that much (it all
> depends vastly on details), or even loose in some cases. What do you
> think should the strategy be?
>

What is the current status of this discussion?
I'd like to know whether it is a considered alternative implementation.

There is also a discussion in python-ideas right now where this
alternative is mentioned, and I think especially for small dicts
as **kwargs, it could be a cheap way to introduce order.

Is this going on, somewhere? I'm quite interested on that.

cheers - chris

-- 
Christian Tismer             :^)   <mailto:tismer at stackless.com>
Software Consulting          :     Have a break! Take a ride on Python's
Karl-Liebknecht-Str. 121     :    *Starship* http://starship.python.net/
14482 Potsdam                :     PGP key -> http://pgp.uni-mainz.de
phone +49 173 24 18 776  fax +49 (30) 700143-0023
PGP 0x57F3BF04       9064 F4E1 D754 C2FF 1619  305B C09C 5A3B 57F3 BF04
       whom do you want to sponsor today?   http://www.stackless.com/