[Tutor] minor display issue with python dictionaries

Joel Goldstick joel.goldstick at gmail.com
Mon Nov 25 02:23:26 CET 2013


On Sun, Nov 24, 2013 at 8:22 PM, Joel Goldstick
<joel.goldstick at gmail.com> wrote:
> On Sun, Nov 24, 2013 at 8:12 PM, Steven D'Aprano <steve at pearwood.info> wrote:
>> On Sun, Nov 24, 2013 at 09:33:11PM +0530, Reuben wrote:
>> [...]
>>> From the above output, I see key 'c' is at third position during input, but
>>> while displaying the output it is displayed at second position
>>
>> Dictionaries are deliberately made unordered. That means the order that
>> items will be displayed is free to vary, not only from how you entered
>> them, but in principle they might even vary each time you inspect the
>> dict. (They probably won't, but they could.)
>>
>> Python dicts are actually "hash tables", which is a standard computer
>> science data structure that you can look up, but in a nutshell, a hash
>> table is an array indexed by the key's hash value. Why do we care about
>> hash tables? Because they make a really fast and efficient way to look
>> up keys, and from the key get access to a value.
>>
>> Let's start with the simplest way to store a key and value pair, an
>> unsorted table of pairs with a marker indicating "blank", so you know
>> when you've reached the end. Here's an example of a half-filled table:
>>
>> { ("Hello", value),  # key first, then some value
>>   ("World", value),
>>   ("Spam", value),
>>   ("Eggs", value),
>>   ("Cheese", value),
>>   <blank>,
>>   <blank>,
>>   <blank>,
>>   <blank>,
>>   <blank>,
>> }
>>
>> In order to find a key in the table, you have to inspect every position
>> until you reach either the key you are looking for, or <blank>, in which
>> case you know that the key is not there. So to find "Eggs", you have to
>> inspect "Hello", "World", "Spam" and finally "Eggs". On average, a
>> successful search takes half as many inspections as there are keys, e.g.
>> if you have 10 keys in the table, you need to inspect 5 positions; if
>> there are 100 keys, you need to inspect 50 positions on average. If
>> there are 1000 keys, you need to inspect 500 positions.
>>
>> This is pretty slow. Imagine looking up a word in a dictionary (a real
>> paper dictionary), but with a twist: the words are in whatever order the
>> dictionary writer first thought of them, not alphabetical order, so for
>> any word you have to start at the beginning and read every single world
>> until you find it.
>>
>> Obviously we can do better, by keeping the words sorted. Our table will
>> look like this:
>>
>> { ("Cheese", value),
>>   ("Eggs", value),
>>   ("Hello", value),
>>   ("Spam", value),
>>   ("World", value),
>>   <blank>,
>>   <blank>,
>>   <blank>,
>>   <blank>,
>>   <blank>,
>> }
>>
>> In this case, we can use a technique called "binary search" to narrow
>> down the possible places the key might be. This is very much like what
>> you probably do when looking up words in a paper dictionary: flip the
>> dictionary open to the middle, then decide whether you've gone too far
>> or not far enough. Each time, you split the remaining half in half
>> again, until you've narrowed down to the page containing the word you
>> want.
>>
>> In a sorted table with binary search, the savings over an unsorted table
>> can be very high. With 10 keys, on average you will end up inspecting 3
>> or 4 items, which isn't much improvement over 5 for the unsorted case,
>> but with 100 keys, you will inspect 6 or 7, and with 1000 keys, 9 or 10.
>> With a million keys, you only need to inspect about 20 items to either
>> find the key you are looking for, or determine it isn't there.
>>
>> Can we do better than binary search? Yes we can, and we do so with a
>> hash table, which is what Python dicts are.
>>
>> The secret to the hash table is that instead of putting all the keys at
>> the beginning of the table, we want the entries to be scattered at
>> random all over the place. Only not quite at random, since we need a way
>> to jump straight to the key when asked. That's the *hash function*,
>> which converts a key to an arbitrary, and distinct, index:
>>
>> { <blank>,
>>   ("World", value),   # hash of "World" is 1
>>   ("Cheese", value),  # hash of "Cheese" is 2
>>   <blank>,
>>   <blank>,
>>   ("Eggs", value),    # hash of "Eggs" is 5
>>   <blank>,
>>   ("Spam", value),    # hash of "Spam" is 7
>>   <blank>,
>>   ("Hello", value),   # hash of "Hello" is 9
>> }
>>
>> So the positions of the keys in the table depend on the keys, not the
>> order you write them down. The advantage of this is that looking for a
>> key requires (near enough to) constant time. If you want to see whether
>> "Eggs" is in the table, it doesn't matter whether there are ten keys or
>> ten billion, it will take exactly the same amount of time: hash the key
>> "Eggs" to give you index 5, look at index 5, it is either there or it
>> isn't.
>>
>> Now, in reality we can't actually guarantee that every hash value is
>> distinct. In real life, a hash table has to deal with "collisions",
>> which is when two different keys have the same hash value, and that will
>> spoil the nice simple constant time performance. But collisions are
>> usually rare, and so when discussing hash tables in broad terms we
>> normally ignore them.
>>
>> In Python, you can calculate the hash value of a key with the hash()
>> function. The exact values you get may depend on the specific version of
>> Python you are using, but to give you an example:
>>
>> py> hash("cat")
>> 405875055
>> py> hash("hat")
>> 1255409212
>> py> hash("fat")
>> -750391218
>>
>>
>> So you can see, by changing just one letter of the key, we get a
>> completely different hash value. A good hash function -- and Python's
>> hash function is good -- will will mix up the positions of keys in a
>> very unpredictable way.
>>
>> The end result of this is that when displaying a dict, keys are shown in
>> whatever order they happen to be inside the hash table. That's
>> unpredicable, depending on the details of the hash function, the actual
>> keys, and whether there have been any collisions or not.
>>
>> If you care about this, it is easy enough to sort the keys before
>> printing them:
>>
>> # instead of this
>> print mydict.keys()
>>
>> # do this
>> print sorted(mydict.keys())
>>
>>
>> If you care about keeping the insertion order, rather than alphabetical
>> order, you can use an OrderedDict:
>>
>> from collections import OrderedDict
>>
>>
>> but OrderedDicts are slower and use more memory than regular dicts,
>> since they have to track the order that the keys were inserted.
>>
>>
>>
>> --
>> Steven
>> _______________________________________________
>> Tutor maillist  -  Tutor at python.org
>> To unsubscribe or change subscription options:
>> https://mail.python.org/mailman/listinfo/tutor
>
> Well explained Stephen

Oops... I mean Steven!...
>
> --
> Joel Goldstick
> http://joelgoldstick.com



-- 
Joel Goldstick
http://joelgoldstick.com


More information about the Tutor mailing list