Perfect hashing for Py

Raymond Hettinger python at rcn.com
Sat Jul 12 03:13:26 EDT 2008


On Jul 11, 3:01 pm, bearophileH... at lycos.com wrote:
> I have found this perfect hash (minimal too) implementation:http://burtleburtle.net/bob/hash/perfect.html
>
> I have already translated part of it to D, and it seems to work well
> enough. As discussed in the PyConDue, I think this may be used in
> frozenset (and frozendict) to build a (minimal too?) perfect hash on
> the fly, to allow (hopefully) faster retrieval of items that don't
> change.

A few thoughts:  Objects currently have their own hash function that
is independent of a given container.  We also have to hash types other
than strings (used in the examples in the links you provided).  While
a container such as a dict or set is being built, we don't even know
how many unique keys there are until the container is fully
populated.  Set-to-set operations and set-to-dict operations get their
speed from knowing that both containers use the same hash values --
this would get disrupted if each container had its own hash function.
Some types like strings (especially interned strings) remember their
own hash value -- this makes it very fast to look their values in a
set or dict -- that would be defeated if each container has its own
hash function which would need to be recomputed for every lookup.  An
important use case for sets is uniquification -- in that use case,
speed is determined by insertion time, we just don't care about
subsequent lookup time -- anything that slows down insertion (such as
computing a perfect hash value) works to the detriment of that use
case).  In the current design, sets and dicts are never more than 2/3
full and are usually much more sparse than that -- accordingly lookups
average between 1 and 1.5 probes per lookup.  This is somewhat hard to
beat with perfect hashing since we typically get very few collisions
anyway -- so you get the cost of increased insertion time, loss of
objects being able to remember their own hash, challenges with non-
string keys, a more complicated algorithm, and decreased
interoperability for set-to-set and set-to-dict options -- the only
benefits are to reduce collisions to zero when they are already not
that common.

For frozensets, it may be possible to add an optimize() method that
takes a completely built frozenset and optimizes its insertion order
and/or increases its size to make it arbitrarily sparse.  That would
only be useful for cases when the costs of optimizing the table is
fully repaid by an extensive number of lookups.  There are use some
cases where it would be pay table optimization costs in order to win
decreased lookup costs, but there are plenty of use cases where that
would not be a winning trade-off.  So, the optimize() step would need
to be optional, not builtin.


Raymond

FWIW, I mentioned all this at PyConDue but the message must have
gotten lost.




More information about the Python-list mailing list