[Tutor] Simple counter to determine frequencies of words in a document

Sat Nov 20 23:37:24 CET 2010

If the file is big use Peter's method, but 45 minutes still seems 
very long so it may be theres a hidden bug in there somehwew.

However...

 > When I look at the current processes running on my computer, I  see the

> Python process taking 100% of the CPU. Since my computer has  a
> multi-core processor, I'm assuming this process is using only one of
> the  cores because another monitor tells me that the CPU usage is under
> 20%.   This doesn't make much sense to me. 

Its perfectly normal. The computer asssigns Python to one core and uses 
the other cores to run other tasks. Thats why its called muylti-tasking. 
There are tricks to spread the Python load over multiple cores but that 
is rarely necessaryy, and I don't think we need it here.

> any in this case. See, I'm not only a newbie in Python but a  newbie
> with IDEs as well. I'm using Eclipse (probably I should have  started
> with something smaller and simpler) and I see the following  error
> message:

Don;t run your code inside the IDE except for testing. IDEs are 
Development Environments, they are not ideal for executing production 
code. Run your file from the Terminal command prompt directly.

> def countWords2(wordlist): #as proposed by Peter Otten
>      word_table = {}
>     for word in wordlist:
>          if word in word_table:
>              word_table[word] += 1
>         else:
>              word_table[word] = 1

OK to here...

>          count = wordlist.count(word)
>         word_table[word] =  count

But you don;t need these lines, they are calling count for every word 
which causes Python to reread the string for every word. You are 
counting the occurences as you go in this approach with the += 1 line
And in fact the assignment to word_table here is overwriting the 
incremental counter and negating the value of the optimisation!

>     return sorted(
>                    word_table.items(), key=lambda item: item[1],  reverse=True
>                    )

> words = getWords('tokens_short.txt')
> table = countWords(words) #  or table = countWords2(words)
> writeTable('output.txt',  table)

It would be worth utting some print statements between these functions 
just to monitor progress. Something like

print " reading file..."
print " counting words..."
print "writing file..."

That way you can see which function is running slowly, although 
it is almost certainly the counting. But as a general debugging tip 
its worth remembering. A few (and I mean a few, dont go mad!) 
print statements can narrow things down very quickly.

> every time you encounter the same word in the loop.  This is more  or
> less what Peter said of the solution proposed by Alan,  right?

Correct, but you have replicated that i Peters optimised version.

HTH,

Alan G.