[Tutor] Simple counter to determine frequencies of words in a document

Sun Nov 21 00:01:04 CET 2010

Good evening,

 : It turns out that matters of efficiency appear to be VERY 
 : important in this case. The example in my message was a very 
 : short string but the file that I'm trying to process is pretty 
 : big (20MB of text).

Efficiency is best addressed first and foremost, not by hardware, 
but by choosing the correct data structure and algorithm for 
processing the data.  You have more than enough hardware to deal 
with this problem, and appear to be wrestling still with why this 
apparently simple problem is

An earlier responder (Peter Otten) pointed out to you that 
efficiency was one issue:

   you repeat counting the number of occurrences of a word that 
   appears N times N times

And another (Steven D'Aprano) pointed out that your countWords will 
be inefficient, but probably tolerable for data sets under about 
10000 words.

 : frequencies and it has been running for over half an hour without 
 : having generated the output file yet. I'm using a pretty powerful 
 : computer (core i7 with 8GB of RAM) so I'm a little surprised (and 
 : a bit worried as well) that the process hasn't finished yet. I 
 : tested the script before with a much smaller file and the output 
 : was as desired.

[your comment, snipped in here, out of order]

 : However, even with countWords2, which is supposed to overcome this
 : problem, it feels as if I've entered an infinite loop.

You have a 20MB file and 8GB of RAM, and it has taken half an hour?  
You have entered an infinite loop (or some other horrible thing).  
First, I would say that you don't have to worry too much about 
efficiency, for your first draft, just work on correctness of 
result.  My machine is not as snappy as yours and finishes the job 
in the sloppiest way in well under 1 second.

However, once you have solved the problem and feel good about the 
correctness, then learning how to process files/data efficiently 
would be a step in the direction of processing the same problem on a 
20GB or 20TB file.

 : When I look at the current processes running on my computer, I 
 : see the Python process taking 100% of the CPU. Since my computer 
 : has a multi-core processor, I'm assuming this process is using 
 : only one of the cores because another monitor tells me that the 
 : CPU usage is under 20%. 

Correct.

 : This doesn't make much sense to me. I bought a computer with a 
 : powerful CPU precisely to do these kinds of things as fast as 
 : possible. How can it be that Python is only using such a small 
 : amount of processing power?

<digression>

This is far afield from the question of word count, but may be 
useful someday.

The beauty of a multiple processors is that you can run independent 
processes simultaneously (I'm not talking about multitasking).  
Using most languages(*), a single process will only use one of your 
available processors.  Obviously, this was once a significant 
limitation, and so we came up with the idea of threads as a way to 
take advantage of multiple processors inside a single process.  
Python supports threads.  An application must be written to take 
advantage of threading.  I do not think I would recommend that for 
you until you have eked out as much performance as you can using a 
process based model.  Here are the very first links I can find which 
delve into the topic, though there's much more there:

  http://docs.python.org/library/threading.html
  http://www.devshed.com/c/a/Python/Basic-Threading-in-Python/
  http://www.dabeaz.com/python/GIL.pdf

</digression>

OK, on to your code.

 : def countWords(wordlist):
 :     word_table = {}
 :     for word in wordlist:
 :         count = wordlist.count(word)
 :         print "word_table[%s] = %s" % (word,word_table.get(word,'<none>'))
 :         word_table[word] = count

Problem 1:  You aren't returning anything from this function.
  Add:
       return word_table

Problem 2: You are doing more work than you need.  Peter Otten 
  pointed out how.  To see what he was observing, try this riff on 
  your function:

    def countWords(wordlist):
        word_table = dict()
        for word in wordlist:
            count = wordlist.count(word)
            print "word_table[%s] = %s\n" % (word,word_table.get(word,'<none>'))
            word_table[word] = count
        return word_table

What you should see is evidence that the second and third times that 
you iterate over the word 'the', you are updating an entry in the 
'word_table' dictionary that already exists with the correct value.

 : def countWords2(wordlist): #as proposed by Peter Otten
 :     word_table = {}
 :     for word in wordlist:
 :         if word in word_table:
 :             word_table[word] += 1
 :         else:
 :             word_table[word] = 1
 :         count = wordlist.count(word)
 :         word_table[word] = count
 :     return sorted(
 :                   word_table.items(), key=lambda item: item[1], reverse=True
 :                   )

In the above, countWords2, why not omit these lines:

 :         count = wordlist.count(word)
 :         word_table[word] = count

And, try the function again.

Let try a (bit of a labored) analogy of your problem.  To 
approximate your algorithm.

  I have a clear tube with gumballs of a variety of colors.
  I open up one end of the tube, and mark where I'm starting.

  I pull out a gumball, and discover it's red.
  I mark down on my chalkboard that I have 1 red gumball.
  I count all of the red gumballs I can see in the clear tube.
  I mark down that I have 5 red gumballs.
  I put that gumball back in the other end of the tube.

  I take the second gumball from the tube and discover it's green.
  I mark down that I have 1 green gumball.
  I count all of the green gumballs I can see in the clear tube.
  I mark down that I have 1 green gumball.
  I put that gumball back in the other end of the tube.

  I pull out a gumball, and discover it's red.
  I mark down on my chalkboard that I have 1 red gumball.
  I count all of the red gumballs I can see in the clear tube.
  I mark down that I have 5 red gumballs.
  I put that gumball back in the other end of the tube.

  ...

Do you see the duplicated effort?  I would also suggest re-reading 
both the mail from Peter Otten and Steven D'Aprano.  You appear to 
be much closer than earlier, but both of them are pointing out 
something that you don't appear to quite have grasped yet.

As a side note, you may find it useful to simply play with some 
lists and dictionaries in the python interpreter:

  >>> words = 'in the garden on the bank behind the tree'.split()
  >>> words
  ['in', 'the', 'garden', 'on', 'the', 'bank', 'behind', 'the', 'tree']
  >>> word_table = dict()
  >>> w = words[0]
  >>> word_table[w] = words.count(w)
  >>> word_table
  {'in': 1}
  >>> w = words[1]
  >>> word_table[w] = words.count(w)
  >>> word_table
  {'the': 3, 'in': 1}

Once you gain familiarity with the lists and dicts, you can try out 
collections, as suggested by Peter Otten.

Enjoy and good luck!

-Martin

 * Somebody will be certain to point out a language or languages 
   that provide some sort of facility to abstract the use of 
   multiple processors without the explicit use of threads.

-- 
Martin A. Brown
http://linux-ip.net/