[Tutor] Re: [Tutor]: Totally laughable basic questions [introduction to hashing]

Danny Yoo dyoo@hkn.eecs.berkeley.edu
Tue, 15 Jan 2002 14:04:36 -0800 (PST)


On Tue, 15 Jan 2002, Pijus Virketis wrote:

> > A LIST is rather like a 1 dimensional array. You can change the
> > contents of a list, even append new cells.
> >
> > A DICTIONARY is rather like 2 lists in one, where there is a KEY and a
> > CONTENT.
> 
> One very important difference between a list and a dictionary is that
> a dictionary is not ordered! In other words, it stores the content,
> but not the structure of the content. Because of that, a dictionary is
> a "faster" data structure, but a list is sometimes more useful.

[warning: this is a long message, and tries to explain some of the ideas
behind dictionaries.  Get a cup of coffee first.  *grin*]


Lists can be "fast" too.  Let's say we had a list like this:

###
>>> amino_acids = [ ('ala', 'alanine'),
...                 ('cys', 'cysteine'),
...                 ('asp', 'aspartic acid'),
...                 ('glu', 'glutamic acid'),
...                 ('phe', 'phenylalanine') ]
###

It's very easy, with this list structure, to get the 5th element of the
list:

###
>>> amino_acids[4]
('phe', 'phenylalanine')
###


If we know the position of an element of a list, grabbing at that element
is very fast.  This is important to realize: as long as we have a position
into our list, we can grab at that list element instantly.


It's a whole other question, though, if we don't already know the
position.  Let's say that we're looking for the entry associated with
'glu':

###
>>> def findEntry(query_name, name_value_pairs):
...     for name, value in name_value_pairs:
...         if name == query_name: 
...             return value
...     return None
... 
>>> findEntry('glu', amino_acids)
'glutamic acid'
###

As you might guess, this is not so fast.  We're basically scanning through
each element in our list to search for the entry.  The big problem here is
that we don't know the numerical position of the element, but we still
want to look it up.  This is a very important problem!  If we have a
dictionary of words and definitions, we want to make looking up a
definition really fast.


If only there were a way to somehow something like "glu" into a position
in our list, then we might be able to quickly look up things, if we
arrange our list properly.  And there is a way!

###
>>> hash('glu')
127508733
###

hash() is a function that takes any Python object, and calculates some
random looking number.  It's not random though; it really does depend on
the content of the object:

###
>>> l = [(hash(x), x) for x in ['ala', 'cys', 'asp', 'glu', 'phe'] ]
>>> l.sort()
>>> l
[(-1586925553, 'ala'),
 (-1585925437, 'asp'),
 (127508733, 'glu'),
 (413875008, 'cys'),
 (631676366, 'phe')]
###


(Let's take the definition of hash() as a given for this message; we can
worry about how it works later.)  Now we can imagine a really long list:

###
BIGLIST = [('ala', 'alanine'), None, None, ....,
           ('asp', 'aspartic acid'), None, None, ....,
           ('glu', 'glutamic acid'), None, None, ....,
           ('cys', 'cysteine'), None, None, ....,
           ('phe', 'phenylalanine'),
          ]
###

I'm using four dots here to indicate the hugeness of this list.  If we
were insane enough to actually construct a list of this size, with entries
in the correct positions, then we could instantly look things up, with
something like:

###
def sillyLookupAminoAcid(key):
    return BIGLIST[hash(key) + 1586925553]
###

But again, we run into another problem: most of us are not insane enough
to actually build a list this long!  There are only 5 useful elements in
BIGLIST, so we're really wasting a heck of a lot of memory here.



One approach we can use to reduce the waste is to shrink down the range of
the numbers with the 'remainder' or 'modulo' function:

###
>>> [ (x % 11, y) for (x, y) in l]
[(7, 'ala'), (3, 'asp'), (0, 'glu'), (8, 'cys'), (2, 'phe')]
###

Wow, that's a big improvement!  Using the modulo function, we've just
forced all the numbers to be between 0 and 11, by taking the remainders.  
And we can build a list for this, no sweat.

###
hashed_acids = [('glu', 'glutamic acid'), 
		None,
		('phe', 'phenylalanine'),
		('asp', 'aspartic acid'),
		None,
		None,
		None,
		('ala', 'alanine'),
		('cys', 'cysteine')]

def lookupAminoAcid(key):
    return hashed_acids[ hash(key) % 11 ]
###


Does this actually work?

###
>>> lookupAminoAcid('ala')
('ala', 'alanine')
>>> lookupAminoAcid('asp')
('asp', 'aspartic acid')
###


Wow.  This actually worked.  *grin* And it's fast, which is what we're
looking for: a fast way of looking up definitions if we have a key.


But isn't this all black magic?  Wouldn't you rather not worry about all
this sort of stuff?  That's what dictionaries are for.

###
>>> acid_dict = { 'ala' : 'alanine',
...               'cys' : 'cysteine',
...               'asp' : 'aspartic acid',
...               'glu' : 'glutamic acid',
...               'phe' : 'phenylalanine' }
>>> acid_dict['asp']
'aspartic acid'
###

Dictionaries do pretty much everything that we did above, automatically,
without the mind-numbing boredom of listening to a Computer Science
lecture.

Lesson: use dictionaries --- they are very good.  *grin*


Hope this helps!