[Tutor] Extracting words(quest 2)

Remco Gerlich scarblac@pino.selwerd.nl
Wed, 27 Mar 2002 01:42:44 +0100


On  0, Alexandre Ratti <alex@gabuzomeu.net> wrote:
> Hello,
> 
> 
> At 15:25 26/03/2002 -0500, tutor-request@python.org wrote:
> >From: dman <dman@dman.ddts.net>
> >To: tutor@python.org
> >Subject: Re: [Tutor] Extracting words(quest 2)
> >
> >On Tue, Mar 26, 2002 at 05:06:51PM +0100, Alexandre Ratti wrote:
> >...
> >| (in dictionaries, entry order is randomized to speed up access).
> >
> >This doesn't exactly say what you meant.  The entry order isn't
> >random; it follows a precise algorithm, which is what makes it fast.
> > >From our perspective, when we print a dict, it does appear to be a
> >random order.
> 
> Thanks dman; somehow I thought the entry order really was random. Do you 
> know how this algorithm works? (eg. how are the entries organised?)

Dictionaries are implemented by means of a data structure called 
"hash table". It's a bit late at night here, so I can't spend much time on
finding a good reference... the third Google hit is
http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/hash_tables.html .

That page looks like it's totally correct and therefore impossible to
understand, at first sight.

Basically, a dictionary of n items is an array of m items, were m is rather
larger than n. The keys are then put through a 'hashing function' - usually
the memory address is put through some complicated function so that it
yields a number between 0 and m-1; it's randomish so that different keys are
unlikely to give the same number. Then the key,value pair is stored at that
place in the array.

When you do a lookup in a dictionary, all that needs to be done is compute
the hash value of the key, and you know where the value is. There might be
more than one value there, and the algorithm handles that, but that is
unlikely by means of the construction of the hashing function, and because m
is large enough.

When you see the .keys() of a dictionary, it gives them in the order of
their hash values - which *is* related to the key values by means of the
hashing function, but which *looks* random because the hashing function is
designed to scatter them about as much as possible.

This little summary is proof that even uncomprehensible 2-year CS
explanations are preferable to the drunken ad hoc explanations of a
many-year CS student, no doubt...

Note that computing the hash function doesn't really depend on the number of
items in the dictionary - which is why they're fast, which is why they're so
cool.

-- 
Remco Gerlich