[Tutor] Question about order in a dictionary [how does hashing work?]

Tue Jun 29 03:30:35 EDT 2004

> I'm comfortable with mathematics, but not a computer scientist by any
> means. I've been wanting to learn about the details of hashtables
> (meaning of the term, how implemented, why allowing for faster search,
> etc.). Does anyone know of a good presentation that gets the details
> covered correctly, but doesn't presuppose that I am at least halfway
> through a comp sci BSc?

Hi Brian,

I wrote an introduction to hashing a long long time ago:

    http://mail.python.org/pipermail/tutor/2002-January/011281.html

But I guess it can't hurt to rehash hashing.  *grin*

The core idea is that list lookup is really really fast as long as we're
dealing with numeric indices.

Conceptually, a Python list is a bunch of key-value pairs.  When we see:

    ["zero", "one", "two", "three", "four", "five"]

we might imagine that this list represents the same kind of information as
the following dictionary:

    { 0 : "zero",
      1 : "one",
      2 : "two",
      3 : "three",
      4 : "four",
      5 : "five" }

where the "key" part is implied by the index of our list.

So lists are excellent for key/value lookup, when our keys are small
numbers.

Let's make the problem concrete.  The list:

###
>>> words = ["zero", "one", "two", "three", "four", "five"]
###

makes it really easy for us to go from a single digit to its English
transliteration:

###
>>> words[0]
'zero'
>>> words[4]
'four'
###

But if our keys aren't numbers, what can we do? What if we want to go the
other way?  That is, given a word like 'five', is it easy to get back the
number 5?  Yes, if we're willing to do a search across our list:

###
>>> words.index('five')
5
###

The only problem with this approach is that it's doing a linear scan
across the list.  We can do better than that.

Hashing is an approach that uses a "hash" function.  This "hash" function
transforms a key directly into an integer value, which we then use as an
index into some mangled-up list.

Python comes with a hash builtin()  already:

###
>>> words = ["zero", "one", "two", "three", "four", "five"]
>>> for w in words:
...     print w, hash(w)
...
zero -1293135726
one -261223665
two 323309869
three 1505609005
four -1250885616
five 202874452
###

But these are wild looking numbers.  *grin*

Let's make them look a little nicer by using the remainder "modulo"
operator:

###
>>> for w in words:
...     print w, hash(w), hash(w) % 11
...
zero -1293135726 8
one -261223665 1
two 323309869 3
three 1505609005 10
four -1250885616 9
five 202874452 0
###

(Ignore the magic constant 11 for a moment: we'll revisit it in a moment.)

Notice that the third column contains numbers that are all distinct.
This is neat because, suddenly, we can do something like this:

###
>>> for index, word in enumerate(words):
...     hashed_values[hash(word) % 11] = index
...
>>> hashed_values
[5, 1, None, 2, None, None, None, None, 0, 4, 3]
>>>
>>>
>>> hashed_values[hash('one') % 11]
1
>>> hashed_values[hash('three') % 11]
3
>>> hashed_values[hash('zero') % 11]
0
###

And now we have a fast scheme where we can go back from our arbitrary
string key back to our value.  No linear scan necessary: just a little bit
of arithmetic.

And that's what dictionaries do underneath the surface:  they maintain a
intermediate list --- a "hash table" --- where the elements are scattered
around in some wacky order based on the key hash value.  Since keys will
hash() to a predictable number, when we want to look up a key/value pair
later on, we get to later use hash() again, and take advantage of the
speed of numeric indxed lookup on that intermediate list.

Going back to that magic number: '11' here was cheating.  *grin* It's a
magic constant number I just pulled out of thin air, and it was chosen for
this example just to make all the values distinct.  If we chose a smaller
constant, we would have gotten not-so-nice results:

###
>>> for w in words:
...     print w, hash(w), hash(w) % 3, hash(w) % 5, hash(w) % 7, hash(w) % 11
...
zero -1293135726 0 4 6 8
one -261223665 0 0 4 1
two 323309869 1 4 1 3
three 1505609005 1 0 5 10
four -1250885616 0 4 6 9
five 202874452 1 2 4 0
###

Eleven seemed to be the right number here.

But real hashing usually doesn't have the benefit of hindsight, so
hashtables will often have to deal with the possible situation where two
keys will "collide" together with the same hash value.

A lot of computer science algorithm books will talk about different ways
of resolving hash collisions, but we probably don't need to talk about
them here right now.  These collision resolution systesm have names like
"linear chaining" or "linear probing", both of which sound slightly
painful.  *grin* They're actually not that bad, but this introduction is
already too long.

Does the idea of hashing make more sense now?  This should make it more
clear why the keys in a hash table don't appear to be ordered: they're
deliberately scrambled by the hashing function.

Hope this helps!