[Tutor] minor display issue with python dictionaries
Joel Goldstick
joel.goldstick at gmail.com
Mon Nov 25 02:22:41 CET 2013
On Sun, Nov 24, 2013 at 8:12 PM, Steven D'Aprano <steve at pearwood.info> wrote:
> On Sun, Nov 24, 2013 at 09:33:11PM +0530, Reuben wrote:
> [...]
>> From the above output, I see key 'c' is at third position during input, but
>> while displaying the output it is displayed at second position
>
> Dictionaries are deliberately made unordered. That means the order that
> items will be displayed is free to vary, not only from how you entered
> them, but in principle they might even vary each time you inspect the
> dict. (They probably won't, but they could.)
>
> Python dicts are actually "hash tables", which is a standard computer
> science data structure that you can look up, but in a nutshell, a hash
> table is an array indexed by the key's hash value. Why do we care about
> hash tables? Because they make a really fast and efficient way to look
> up keys, and from the key get access to a value.
>
> Let's start with the simplest way to store a key and value pair, an
> unsorted table of pairs with a marker indicating "blank", so you know
> when you've reached the end. Here's an example of a half-filled table:
>
> { ("Hello", value), # key first, then some value
> ("World", value),
> ("Spam", value),
> ("Eggs", value),
> ("Cheese", value),
> <blank>,
> <blank>,
> <blank>,
> <blank>,
> <blank>,
> }
>
> In order to find a key in the table, you have to inspect every position
> until you reach either the key you are looking for, or <blank>, in which
> case you know that the key is not there. So to find "Eggs", you have to
> inspect "Hello", "World", "Spam" and finally "Eggs". On average, a
> successful search takes half as many inspections as there are keys, e.g.
> if you have 10 keys in the table, you need to inspect 5 positions; if
> there are 100 keys, you need to inspect 50 positions on average. If
> there are 1000 keys, you need to inspect 500 positions.
>
> This is pretty slow. Imagine looking up a word in a dictionary (a real
> paper dictionary), but with a twist: the words are in whatever order the
> dictionary writer first thought of them, not alphabetical order, so for
> any word you have to start at the beginning and read every single world
> until you find it.
>
> Obviously we can do better, by keeping the words sorted. Our table will
> look like this:
>
> { ("Cheese", value),
> ("Eggs", value),
> ("Hello", value),
> ("Spam", value),
> ("World", value),
> <blank>,
> <blank>,
> <blank>,
> <blank>,
> <blank>,
> }
>
> In this case, we can use a technique called "binary search" to narrow
> down the possible places the key might be. This is very much like what
> you probably do when looking up words in a paper dictionary: flip the
> dictionary open to the middle, then decide whether you've gone too far
> or not far enough. Each time, you split the remaining half in half
> again, until you've narrowed down to the page containing the word you
> want.
>
> In a sorted table with binary search, the savings over an unsorted table
> can be very high. With 10 keys, on average you will end up inspecting 3
> or 4 items, which isn't much improvement over 5 for the unsorted case,
> but with 100 keys, you will inspect 6 or 7, and with 1000 keys, 9 or 10.
> With a million keys, you only need to inspect about 20 items to either
> find the key you are looking for, or determine it isn't there.
>
> Can we do better than binary search? Yes we can, and we do so with a
> hash table, which is what Python dicts are.
>
> The secret to the hash table is that instead of putting all the keys at
> the beginning of the table, we want the entries to be scattered at
> random all over the place. Only not quite at random, since we need a way
> to jump straight to the key when asked. That's the *hash function*,
> which converts a key to an arbitrary, and distinct, index:
>
> { <blank>,
> ("World", value), # hash of "World" is 1
> ("Cheese", value), # hash of "Cheese" is 2
> <blank>,
> <blank>,
> ("Eggs", value), # hash of "Eggs" is 5
> <blank>,
> ("Spam", value), # hash of "Spam" is 7
> <blank>,
> ("Hello", value), # hash of "Hello" is 9
> }
>
> So the positions of the keys in the table depend on the keys, not the
> order you write them down. The advantage of this is that looking for a
> key requires (near enough to) constant time. If you want to see whether
> "Eggs" is in the table, it doesn't matter whether there are ten keys or
> ten billion, it will take exactly the same amount of time: hash the key
> "Eggs" to give you index 5, look at index 5, it is either there or it
> isn't.
>
> Now, in reality we can't actually guarantee that every hash value is
> distinct. In real life, a hash table has to deal with "collisions",
> which is when two different keys have the same hash value, and that will
> spoil the nice simple constant time performance. But collisions are
> usually rare, and so when discussing hash tables in broad terms we
> normally ignore them.
>
> In Python, you can calculate the hash value of a key with the hash()
> function. The exact values you get may depend on the specific version of
> Python you are using, but to give you an example:
>
> py> hash("cat")
> 405875055
> py> hash("hat")
> 1255409212
> py> hash("fat")
> -750391218
>
>
> So you can see, by changing just one letter of the key, we get a
> completely different hash value. A good hash function -- and Python's
> hash function is good -- will will mix up the positions of keys in a
> very unpredictable way.
>
> The end result of this is that when displaying a dict, keys are shown in
> whatever order they happen to be inside the hash table. That's
> unpredicable, depending on the details of the hash function, the actual
> keys, and whether there have been any collisions or not.
>
> If you care about this, it is easy enough to sort the keys before
> printing them:
>
> # instead of this
> print mydict.keys()
>
> # do this
> print sorted(mydict.keys())
>
>
> If you care about keeping the insertion order, rather than alphabetical
> order, you can use an OrderedDict:
>
> from collections import OrderedDict
>
>
> but OrderedDicts are slower and use more memory than regular dicts,
> since they have to track the order that the keys were inserted.
>
>
>
> --
> Steven
> _______________________________________________
> Tutor maillist - Tutor at python.org
> To unsubscribe or change subscription options:
> https://mail.python.org/mailman/listinfo/tutor
Well explained Stephen
--
Joel Goldstick
http://joelgoldstick.com
More information about the Tutor
mailing list