[Tutor] minor display issue with python dictionaries

Steven D'Aprano steve at pearwood.info
Mon Nov 25 02:12:24 CET 2013


On Sun, Nov 24, 2013 at 09:33:11PM +0530, Reuben wrote:
[...]
> From the above output, I see key 'c' is at third position during input, but
> while displaying the output it is displayed at second position

Dictionaries are deliberately made unordered. That means the order that 
items will be displayed is free to vary, not only from how you entered 
them, but in principle they might even vary each time you inspect the 
dict. (They probably won't, but they could.)

Python dicts are actually "hash tables", which is a standard computer 
science data structure that you can look up, but in a nutshell, a hash 
table is an array indexed by the key's hash value. Why do we care about 
hash tables? Because they make a really fast and efficient way to look 
up keys, and from the key get access to a value.

Let's start with the simplest way to store a key and value pair, an 
unsorted table of pairs with a marker indicating "blank", so you know 
when you've reached the end. Here's an example of a half-filled table:

{ ("Hello", value),  # key first, then some value
  ("World", value), 
  ("Spam", value),
  ("Eggs", value),
  ("Cheese", value),
  <blank>,
  <blank>, 
  <blank>, 
  <blank>, 
  <blank>,
}

In order to find a key in the table, you have to inspect every position 
until you reach either the key you are looking for, or <blank>, in which 
case you know that the key is not there. So to find "Eggs", you have to 
inspect "Hello", "World", "Spam" and finally "Eggs". On average, a 
successful search takes half as many inspections as there are keys, e.g. 
if you have 10 keys in the table, you need to inspect 5 positions; if 
there are 100 keys, you need to inspect 50 positions on average. If 
there are 1000 keys, you need to inspect 500 positions.

This is pretty slow. Imagine looking up a word in a dictionary (a real 
paper dictionary), but with a twist: the words are in whatever order the 
dictionary writer first thought of them, not alphabetical order, so for 
any word you have to start at the beginning and read every single world 
until you find it.

Obviously we can do better, by keeping the words sorted. Our table will 
look like this:

{ ("Cheese", value),
  ("Eggs", value),
  ("Hello", value),
  ("Spam", value),
  ("World", value), 
  <blank>,
  <blank>, 
  <blank>, 
  <blank>, 
  <blank>,
}

In this case, we can use a technique called "binary search" to narrow 
down the possible places the key might be. This is very much like what 
you probably do when looking up words in a paper dictionary: flip the 
dictionary open to the middle, then decide whether you've gone too far 
or not far enough. Each time, you split the remaining half in half 
again, until you've narrowed down to the page containing the word you 
want.

In a sorted table with binary search, the savings over an unsorted table 
can be very high. With 10 keys, on average you will end up inspecting 3 
or 4 items, which isn't much improvement over 5 for the unsorted case, 
but with 100 keys, you will inspect 6 or 7, and with 1000 keys, 9 or 10. 
With a million keys, you only need to inspect about 20 items to either 
find the key you are looking for, or determine it isn't there.

Can we do better than binary search? Yes we can, and we do so with a 
hash table, which is what Python dicts are.

The secret to the hash table is that instead of putting all the keys at 
the beginning of the table, we want the entries to be scattered at 
random all over the place. Only not quite at random, since we need a way 
to jump straight to the key when asked. That's the *hash function*, 
which converts a key to an arbitrary, and distinct, index:

{ <blank>,
  ("World", value),   # hash of "World" is 1
  ("Cheese", value),  # hash of "Cheese" is 2
  <blank>,
  <blank>, 
  ("Eggs", value),    # hash of "Eggs" is 5
  <blank>, 
  ("Spam", value),    # hash of "Spam" is 7
  <blank>,
  ("Hello", value),   # hash of "Hello" is 9
}

So the positions of the keys in the table depend on the keys, not the 
order you write them down. The advantage of this is that looking for a 
key requires (near enough to) constant time. If you want to see whether 
"Eggs" is in the table, it doesn't matter whether there are ten keys or 
ten billion, it will take exactly the same amount of time: hash the key 
"Eggs" to give you index 5, look at index 5, it is either there or it 
isn't.

Now, in reality we can't actually guarantee that every hash value is 
distinct. In real life, a hash table has to deal with "collisions", 
which is when two different keys have the same hash value, and that will 
spoil the nice simple constant time performance. But collisions are 
usually rare, and so when discussing hash tables in broad terms we 
normally ignore them.

In Python, you can calculate the hash value of a key with the hash() 
function. The exact values you get may depend on the specific version of 
Python you are using, but to give you an example:

py> hash("cat")
405875055
py> hash("hat")
1255409212
py> hash("fat")
-750391218


So you can see, by changing just one letter of the key, we get a 
completely different hash value. A good hash function -- and Python's 
hash function is good -- will will mix up the positions of keys in a 
very unpredictable way.

The end result of this is that when displaying a dict, keys are shown in 
whatever order they happen to be inside the hash table. That's 
unpredicable, depending on the details of the hash function, the actual 
keys, and whether there have been any collisions or not.

If you care about this, it is easy enough to sort the keys before 
printing them:

# instead of this
print mydict.keys()

# do this
print sorted(mydict.keys())


If you care about keeping the insertion order, rather than alphabetical 
order, you can use an OrderedDict:

from collections import OrderedDict


but OrderedDicts are slower and use more memory than regular dicts, 
since they have to track the order that the keys were inserted.



-- 
Steven


More information about the Tutor mailing list