[Python-Dev] Dictionary tuning

Wed, 30 Apr 2003 11:06:51 -0400

> Jeremy had a possibly happy idea wrt this:  make the collision probe
> sequence start in a slot adjacent to the colliding slot.  That's likely to
> get sucked into cache "for free", tagging along with the slot that collided.
> If that's effective, it could buy much of the "large dict" speed gains
> Raymond saw without increasing the dict size.

I worked on similar approaches last month and found them wanting.
The concept was that a 64byte cache line held 5.3 dict entries and 
that probing those was much less expensive than making a random
probe into memory outside of the cache.

The first thing I learned was that the random probes were necessary
to reduce collisions.  Checking the adjacent space is like a single
step of linear chaining, it increases the number of collisions.

That would be fine if the cost were offset by decreased memory
access time; however, for small dicts, the whole dict is already
in cache and having more collisions degrades performance
with no compensating gain.

The next bright idea was to have a separate lookup function for
small dicts and for larger dictionaries.  I set the large dict lookup
to search adjacent entries.  The good news is that an artificial
test of big dicts showed a substantial improvement (around 25%).
The bad news is that real programs were worse-off than before.

A day of investigation showed the cause.  The artificial test
accessed keys randomly and showed the anticipated benefit. However,
real programs access some keys more frequently than others
(I believe Zipf's law applies.)  Those keys *and* their collision
chains are likely already in the cache.  So, big dicts had
the same limitation as small dicts:  You always lose when you
accept more collisions in return for exploiting cache locality.

The conclusion was clear, the best way to gain performance
was to have fewer collisions in the first place.  Hence, I 
resumed experiments on sparsification.

> 
> If someone wants to experiment with that in lookdict_string(), stick a new
> 
>     ++i;
> 
> before the for loop, and move the existing
> 
> i = (i << 2) + i + perturb + 1;
> 
> to the bottom of that loop.  Likewise for lookdict().

PyStone gains 1%.
PyBench loses a 1%.
timecell gains 2%     (spreadsheet benchmark)
timemat loses 2%     (pure python matrix package benchmark)
timepuzzle loses 1% (class based graph traverser)

Raymond Hettinger

P.S.  There is one other way to improve cache behavior
but it involves touching code throughout dictobject.c.
Move the entry values into a separate array from the
key/hash pairs.  That way, you get 8 entries per cache line.

P.P.S.  One other idea is to use a different search pattern
for small dictionaries.  Store entries in a self-organizing list
with no holes.  Dummy fields aren't needed which saves
a test in the linear search loop.  When an entry is found,
move it one closer to the head of the list so that the most
common entries get found instantly.  Since there are no
holes, all eight cells can be used instead of the current
maximum of five.  Like the current arrangement, the
whole small dict fits into just two cache lines.

#################################################################
#################################################################
#################################################################
#####
#####
#####
#################################################################
#################################################################
#################################################################

#################################################################
#################################################################
#################################################################
#####
#####
#####
#################################################################
#################################################################
#################################################################