[Python-Dev] More compact dictionaries with faster iteration
Christian Tismer
tismer at stackless.com
Wed May 15 13:32:39 CEST 2013
Hi Raymond,
On 08.01.13 15:49, Maciej Fijalkowski wrote:
> On Mon, Dec 10, 2012 at 3:44 AM, Raymond Hettinger
> <raymond.hettinger at gmail.com> wrote:
>> The current memory layout for dictionaries is
>> unnecessarily inefficient. It has a sparse table of
>> 24-byte entries containing the hash value, key pointer,
>> and value pointer.
>>
>> Instead, the 24-byte entries should be stored in a
>> dense table referenced by a sparse table of indices.
>>
>> For example, the dictionary:
>>
>> d = {'timmy': 'red', 'barry': 'green', 'guido': 'blue'}
>>
>> is currently stored as:
>>
>> entries = [['--', '--', '--'],
>> [-8522787127447073495, 'barry', 'green'],
>> ['--', '--', '--'],
>> ['--', '--', '--'],
>> ['--', '--', '--'],
>> [-9092791511155847987, 'timmy', 'red'],
>> ['--', '--', '--'],
>> [-6480567542315338377, 'guido', 'blue']]
>>
>> Instead, the data should be organized as follows:
>>
>> indices = [None, 1, None, None, None, 0, None, 2]
>> entries = [[-9092791511155847987, 'timmy', 'red'],
>> [-8522787127447073495, 'barry', 'green'],
>> [-6480567542315338377, 'guido', 'blue']]
>>
>> Only the data layout needs to change. The hash table
>> algorithms would stay the same. All of the current
>> optimizations would be kept, including key-sharing
>> dicts and custom lookup functions for string-only
>> dicts. There is no change to the hash functions, the
>> table search order, or collision statistics.
>>
>> The memory savings are significant (from 30% to 95%
>> compression depending on the how full the table is).
>> Small dicts (size 0, 1, or 2) get the most benefit.
>>
>> For a sparse table of size t with n entries, the sizes are:
>>
>> curr_size = 24 * t
>> new_size = 24 * n + sizeof(index) * t
>>
>> In the above timmy/barry/guido example, the current
>> size is 192 bytes (eight 24-byte entries) and the new
>> size is 80 bytes (three 24-byte entries plus eight
>> 1-byte indices). That gives 58% compression.
>>
>> Note, the sizeof(index) can be as small as a single
>> byte for small dicts, two bytes for bigger dicts and
>> up to sizeof(Py_ssize_t) for huge dict.
>>
>> In addition to space savings, the new memory layout
>> makes iteration faster. Currently, keys(), values, and
>> items() loop over the sparse table, skipping-over free
>> slots in the hash table. Now, keys/values/items can
>> loop directly over the dense table, using fewer memory
>> accesses.
>>
>> Another benefit is that resizing is faster and
>> touches fewer pieces of memory. Currently, every
>> hash/key/value entry is moved or copied during a
>> resize. In the new layout, only the indices are
>> updated. For the most part, the hash/key/value entries
>> never move (except for an occasional swap to fill a
>> hole left by a deletion).
>>
>> With the reduced memory footprint, we can also expect
>> better cache utilization.
>>
>> For those wanting to experiment with the design,
>> there is a pure Python proof-of-concept here:
>>
>> http://code.activestate.com/recipes/578375
>>
>> YMMV: Keep in mind that the above size statics assume a
>> build with 64-bit Py_ssize_t and 64-bit pointers. The
>> space savings percentages are a bit different on other
>> builds. Also, note that in many applications, the size
>> of the data dominates the size of the container (i.e.
>> the weight of a bucket of water is mostly the water,
>> not the bucket).
>>
>>
>> Raymond
>> _______________________________________________
>> Python-Dev mailing list
>> Python-Dev at python.org
>> http://mail.python.org/mailman/listinfo/python-dev
>> Unsubscribe: http://mail.python.org/mailman/options/python-dev/fijall%40gmail.com
> One question Raymond.
>
> The compression ratios stay true provided you don't overallocate entry
> list. If you do overallocate you don't really gain that much (it all
> depends vastly on details), or even loose in some cases. What do you
> think should the strategy be?
>
What is the current status of this discussion?
I'd like to know whether it is a considered alternative implementation.
There is also a discussion in python-ideas right now where this
alternative is mentioned, and I think especially for small dicts
as **kwargs, it could be a cheap way to introduce order.
Is this going on, somewhere? I'm quite interested on that.
cheers - chris
--
Christian Tismer :^) <mailto:tismer at stackless.com>
Software Consulting : Have a break! Take a ride on Python's
Karl-Liebknecht-Str. 121 : *Starship* http://starship.python.net/
14482 Potsdam : PGP key -> http://pgp.uni-mainz.de
phone +49 173 24 18 776 fax +49 (30) 700143-0023
PGP 0x57F3BF04 9064 F4E1 D754 C2FF 1619 305B C09C 5A3B 57F3 BF04
whom do you want to sponsor today? http://www.stackless.com/
More information about the Python-Dev
mailing list