[Tutor] Simple counter to determine frequencies of words in a document
Martin A. Brown
martin at linux-ip.net
Sun Nov 21 00:01:04 CET 2010
Good evening,
: It turns out that matters of efficiency appear to be VERY
: important in this case. The example in my message was a very
: short string but the file that I'm trying to process is pretty
: big (20MB of text).
Efficiency is best addressed first and foremost, not by hardware,
but by choosing the correct data structure and algorithm for
processing the data. You have more than enough hardware to deal
with this problem, and appear to be wrestling still with why this
apparently simple problem is
An earlier responder (Peter Otten) pointed out to you that
efficiency was one issue:
you repeat counting the number of occurrences of a word that
appears N times N times
And another (Steven D'Aprano) pointed out that your countWords will
be inefficient, but probably tolerable for data sets under about
10000 words.
: frequencies and it has been running for over half an hour without
: having generated the output file yet. I'm using a pretty powerful
: computer (core i7 with 8GB of RAM) so I'm a little surprised (and
: a bit worried as well) that the process hasn't finished yet. I
: tested the script before with a much smaller file and the output
: was as desired.
[your comment, snipped in here, out of order]
: However, even with countWords2, which is supposed to overcome this
: problem, it feels as if I've entered an infinite loop.
You have a 20MB file and 8GB of RAM, and it has taken half an hour?
You have entered an infinite loop (or some other horrible thing).
First, I would say that you don't have to worry too much about
efficiency, for your first draft, just work on correctness of
result. My machine is not as snappy as yours and finishes the job
in the sloppiest way in well under 1 second.
However, once you have solved the problem and feel good about the
correctness, then learning how to process files/data efficiently
would be a step in the direction of processing the same problem on a
20GB or 20TB file.
: When I look at the current processes running on my computer, I
: see the Python process taking 100% of the CPU. Since my computer
: has a multi-core processor, I'm assuming this process is using
: only one of the cores because another monitor tells me that the
: CPU usage is under 20%.
Correct.
: This doesn't make much sense to me. I bought a computer with a
: powerful CPU precisely to do these kinds of things as fast as
: possible. How can it be that Python is only using such a small
: amount of processing power?
<digression>
This is far afield from the question of word count, but may be
useful someday.
The beauty of a multiple processors is that you can run independent
processes simultaneously (I'm not talking about multitasking).
Using most languages(*), a single process will only use one of your
available processors. Obviously, this was once a significant
limitation, and so we came up with the idea of threads as a way to
take advantage of multiple processors inside a single process.
Python supports threads. An application must be written to take
advantage of threading. I do not think I would recommend that for
you until you have eked out as much performance as you can using a
process based model. Here are the very first links I can find which
delve into the topic, though there's much more there:
http://docs.python.org/library/threading.html
http://www.devshed.com/c/a/Python/Basic-Threading-in-Python/
http://www.dabeaz.com/python/GIL.pdf
</digression>
OK, on to your code.
: def countWords(wordlist):
: word_table = {}
: for word in wordlist:
: count = wordlist.count(word)
: print "word_table[%s] = %s" % (word,word_table.get(word,'<none>'))
: word_table[word] = count
Problem 1: You aren't returning anything from this function.
Add:
return word_table
Problem 2: You are doing more work than you need. Peter Otten
pointed out how. To see what he was observing, try this riff on
your function:
def countWords(wordlist):
word_table = dict()
for word in wordlist:
count = wordlist.count(word)
print "word_table[%s] = %s\n" % (word,word_table.get(word,'<none>'))
word_table[word] = count
return word_table
What you should see is evidence that the second and third times that
you iterate over the word 'the', you are updating an entry in the
'word_table' dictionary that already exists with the correct value.
: def countWords2(wordlist): #as proposed by Peter Otten
: word_table = {}
: for word in wordlist:
: if word in word_table:
: word_table[word] += 1
: else:
: word_table[word] = 1
: count = wordlist.count(word)
: word_table[word] = count
: return sorted(
: word_table.items(), key=lambda item: item[1], reverse=True
: )
In the above, countWords2, why not omit these lines:
: count = wordlist.count(word)
: word_table[word] = count
And, try the function again.
Let try a (bit of a labored) analogy of your problem. To
approximate your algorithm.
I have a clear tube with gumballs of a variety of colors.
I open up one end of the tube, and mark where I'm starting.
I pull out a gumball, and discover it's red.
I mark down on my chalkboard that I have 1 red gumball.
I count all of the red gumballs I can see in the clear tube.
I mark down that I have 5 red gumballs.
I put that gumball back in the other end of the tube.
I take the second gumball from the tube and discover it's green.
I mark down that I have 1 green gumball.
I count all of the green gumballs I can see in the clear tube.
I mark down that I have 1 green gumball.
I put that gumball back in the other end of the tube.
I pull out a gumball, and discover it's red.
I mark down on my chalkboard that I have 1 red gumball.
I count all of the red gumballs I can see in the clear tube.
I mark down that I have 5 red gumballs.
I put that gumball back in the other end of the tube.
...
Do you see the duplicated effort? I would also suggest re-reading
both the mail from Peter Otten and Steven D'Aprano. You appear to
be much closer than earlier, but both of them are pointing out
something that you don't appear to quite have grasped yet.
As a side note, you may find it useful to simply play with some
lists and dictionaries in the python interpreter:
>>> words = 'in the garden on the bank behind the tree'.split()
>>> words
['in', 'the', 'garden', 'on', 'the', 'bank', 'behind', 'the', 'tree']
>>> word_table = dict()
>>> w = words[0]
>>> word_table[w] = words.count(w)
>>> word_table
{'in': 1}
>>> w = words[1]
>>> word_table[w] = words.count(w)
>>> word_table
{'the': 3, 'in': 1}
Once you gain familiarity with the lists and dicts, you can try out
collections, as suggested by Peter Otten.
Enjoy and good luck!
-Martin
* Somebody will be certain to point out a language or languages
that provide some sort of facility to abstract the use of
multiple processors without the explicit use of threads.
--
Martin A. Brown
http://linux-ip.net/
More information about the Tutor
mailing list