Issue in printing top 20 dictionary items by dictionary value

Alexander Blinne news at blinne.net
Sat Oct 4 12:28:43 CEST 2014


Am 04.10.2014 um 11:11 schrieb Shiva:
> Hi All,
> 
> I have written a function that 
> -reads a file 
> -splits the words and stores it in a dictionary as word(key) and the total
> count of word in file (value).
> 
> I want to print the words with top 20 occurrences in the file in reverse
> order - but can't figure it out.

As python is a high-level language with a comprehensive standard
library, there ought to be a built-in method to sort stuff... Wait, but
there ist:

sorted(iterable[, key][, reverse])

The tricky part is: a dictionary is not storing the order of its
entries. But you could just go with the list of your keys, sorted by the
count of your words and then go from there.

> Here is my function:
> 
> def print_top(filename):
> 
>     #Open a file
>     path = '/home/BCA/Documents/LearnPython/ic/'
>     fname = path + filename
>     print ('filename: ',fname)
>     filetext = open(fname)
> 
>     #Read the file
>     textstorage={}
> 
>     #print(type(textstorage))
>     readall = filetext.read().lower()
>     eachword = set(readall.split())
> 
>     #store split words as keys in dictionary
>     for w in eachword:
>         textstorage[w] = readall.count(w)
> 
>     #print top 20 items in dictionary by decending order of val
>     # This bit is what I can't figure out.

orderedwords = sorted(textstorage.keys(), key=textstorage.get, reverse=True)
toptwenty = orderedwords[:20]

for dkey in toptwenty:
    print(dkey,textstorage[dkey])

The interesting part is the "key=textstorage.get"-portion. This defines
according to what "key" two values of the iterable (the keys of the
dict) should be compared. The method textstorage.get will accept a word
and return the count of that word. This count is used to sort the words
- in reverse order.


Another remark: Your algorithm works, but imagine this: Your text could
be more than a few lines long and contain more than a few different
words. When you convert your text into the set of alle the words you
have to walk through it once. You only type one line into python, but
this is what python has to do at that point. Afterwards, for every
single word, you use the count-method of the complete text. This also
walks through the whole text - each time. So if your text has N
different words you walk through it N+2 times (your readall line also
walks completely through it!). It is possible to solve your problem
while only walking through the text only a few (independent of N) times
or even only a single time! And the larger your text gets, the more
important this gets. This is not an issue in python but in all of
computer science.

So try and rethink your algorithm with that goal in mind: Only walk
through the text once. Python makes strong use on iterators, also file
objects can be used as iterators. The count in your dictionary can be
updated while you walk through the text. The get-method has a keyword
argument for creating default values, which might be useful here.
Another thing worth of mentioning is, that python has exactly this kind
of machinery already built-in (collections.Counter), but you should try
and implement a simple version of it yourself as exercise.

Alexander



More information about the Python-list mailing list