[Python-Dev] Simple dicts

Wed, 14 May 2003 23:08:18 -0400

Behind the scenes, Damien Morton has been playing with radically different
designs for "small" dicts.  This reminded me that, in previous lives, I
always used chaining ("linked lists") for collision resolution in hash
tables.  I don't have time to pursue this now, but it would make a nice
experiment for someone who wants to try a non-trivial but still-simple and
interesting project.  It may even pay off, but that shouldn't be a
motivation <wink>.

Design notes:

PyDictEntry would grow a new

	PyDictEntry *next;

slot, boosting it from 12 to 16 bytes on a 32-bit box.  This wasn't
reasonable when Python was designed, but pymalloc allocates 16-byte chunks
with virtually no wasted space.

PyDictObject would loose the ma_fill and ma_table and ma_smalltable members,
and gain a pointer to a variable-size vector of pointers

	PyDictEntry **first;

For a hash table with 2**n slots, this vector holds 2**n pointers, memset to
NULL initially.  The hash chain for an object with hash code h starts at
first[h & ma_mask], and is linked together via PyDictEntry.next pointers.
Hash chains per hash code are independent.  There's no use for the "dummy"
state.  There's no *logical* need for the "unused" state either, although a
micro-optimizer may want to retain that in some form.

Memory use isn't nearly as bad as it may first appear -- to the contrary,
it's probably better on average!  Assuming a 32-bit box:

Current:  tables are normally 1/3 to 2/3 full.  If there are N active
objects in the table, at 1/3 full the table contains 3*N PyDictEntries, and
at 2/3 full it contains 1.5*N PyDictEntries, for a total of (multiplying by
12 bytes per PyDictEntry) 18*N to 36*N bytes.

Chaining:  assuming tables are still 1/3 to 2/3 full.  At 1/3 full there are
3*N first pointers and at 2/3 full there are 1.5*N first pointers, for a
total of 6*N to 12*N bytes for first pointers.  Independent of load factor,
16*N bytes are consumed by the larger PyDictEntry structs.  Adding, that's
22*N to 28*N bytes.  This relies on pymalloc's tiny wastage when allocating
16-byte chunks (under 1%).  The worst case is worse than the current scheme,
and the best case is better.  The average is probably better.

Note that "full" is a misnomer here.  A chained table with 2**i slots can
actually hold any number of objects, even if i==0; on average, each hash
chain contains N/2**i PyDictEntry structs.

Note that a small load factor is less important with chained resolution than
with open addressing, because collisions at different hash codes can't
interfere with each other (IOW, an object in slot #i never slows access to
an object in the slot #j collision list, whenever i != j; "breathing room"
to ease cross-hashcode collision pressure isn't needed; primary collisions
are all that exist).

Collision resolution code:  Just a list walk.  For example, lookdict_string
could be, in its entirety:

static dictentry *
lookdict_string(dictobject *mp, PyObject *key, register long hash)
{
	dictentry *p = mp->first[hash & mp->ma_mask];

	if (PyString_CheckExact(key)) {
		for (; p != NULL; p = p->next) {
			if (p->me_key == key ||
			    (ep->me_hash == hash &&
			     PyString_Eq(ep->me_key, key)))
				return p;
		}
		return NULL;
	}

	mp->ma_lookup = lookdict;
	return lookdict(mp, key, hash);
}

Resizing:  Probably much faster.  The vector of first pointers can be
realloc'ed, and sometimes benefit from the platform malloc extending it
in-place.  No other memory allocation operation is needed on a resize.
Instead about half the PyDictEntry structs will need to move to "the other
half" of the table (the structs themselves don't get copied or moved; they
just get relinked via their next pointers).

Copying:  Probably slower, due to needing a PyObject_Malloc() call for each
key/value pair.

Building a dict up:  Probably slower, again due to repeated
PyObject_Malloc() calls.

Referencing a dict:  Probably a wash, although because the code can be so
much simpler compilers may do a better job of optimizing it, and no tests
are needed to distinguish among three kinds of states.  Out-of-cache dicts
are killers either way.  Also see next point.

Optimizations:  The cool algorithmic thing about chaining is that
self-organizing gimmicks (like swap-toward-front (or move-to-front) on
reference) are easy to code and run fast (again, the dictentry structs don't
move, you just need to fiddle a few next pointers).  When collision chains
can collide, dynamic table reorganization is so complicated and expensive
that nobody has even thought about trying it in Python.  When they can't
collide, it's simple.  Note too that since the memory burden per unused slot
falls from 12 to 4 bytes, sparser tables are less painful to contemplate.

Small dicts:  There's no gimmick here to favor them.