[Tutor] Simple counter to determine frequencies of words in a document
ALAN GAULD
alan.gauld at btinternet.com
Sat Nov 20 23:37:24 CET 2010
If the file is big use Peter's method, but 45 minutes still seems
very long so it may be theres a hidden bug in there somehwew.
However...
> When I look at the current processes running on my computer, I see the
> Python process taking 100% of the CPU. Since my computer has a
> multi-core processor, I'm assuming this process is using only one of
> the cores because another monitor tells me that the CPU usage is under
> 20%. This doesn't make much sense to me.
Its perfectly normal. The computer asssigns Python to one core and uses
the other cores to run other tasks. Thats why its called muylti-tasking.
There are tricks to spread the Python load over multiple cores but that
is rarely necessaryy, and I don't think we need it here.
> any in this case. See, I'm not only a newbie in Python but a newbie
> with IDEs as well. I'm using Eclipse (probably I should have started
> with something smaller and simpler) and I see the following error
> message:
Don;t run your code inside the IDE except for testing. IDEs are
Development Environments, they are not ideal for executing production
code. Run your file from the Terminal command prompt directly.
> def countWords2(wordlist): #as proposed by Peter Otten
> word_table = {}
> for word in wordlist:
> if word in word_table:
> word_table[word] += 1
> else:
> word_table[word] = 1
OK to here...
> count = wordlist.count(word)
> word_table[word] = count
But you don;t need these lines, they are calling count for every word
which causes Python to reread the string for every word. You are
counting the occurences as you go in this approach with the += 1 line
And in fact the assignment to word_table here is overwriting the
incremental counter and negating the value of the optimisation!
> return sorted(
> word_table.items(), key=lambda item: item[1], reverse=True
> )
> words = getWords('tokens_short.txt')
> table = countWords(words) # or table = countWords2(words)
> writeTable('output.txt', table)
It would be worth utting some print statements between these functions
just to monitor progress. Something like
print " reading file..."
print " counting words..."
print "writing file..."
That way you can see which function is running slowly, although
it is almost certainly the counting. But as a general debugging tip
its worth remembering. A few (and I mean a few, dont go mad!)
print statements can narrow things down very quickly.
> every time you encounter the same word in the loop. This is more or
> less what Peter said of the solution proposed by Alan, right?
Correct, but you have replicated that i Peters optimised version.
HTH,
Alan G.
More information about the Tutor
mailing list