[Tutor] Simple counter to determine frequencies of words in a document

Sun Nov 21 17:15:28 CET 2010

Martin, Alan, col speed and everybody that helped: I think I'm going
to stop because I'm repeating myself but it is difficult for me not to
be profuse in my thanks because you guys really go beyond the call of
duty. I love this list. The responses in this list most of the times
don't just address the problem at hand but are also useful in a more
general sense and help people become better programmers. So, thanks
for all the good advice as well as helping me solve the particular
problem I had.

Let me address some particular points you've made:

On Sun, Nov 21, 2010 at 12:01 AM, Martin A. Brown <martin at linux-ip.net> wrote:

>  : It turns out that matters of efficiency appear to be VERY
>  : important in this case. The example in my message was a very
<snip>
> Efficiency is best addressed first and foremost, not by hardware,
> but by choosing the correct data structure and algorithm for
> processing the data.  You have more than enough hardware to deal
> with this problem,

Yes indeed. Now that I fixed the code following your advice and
Alan's, it took a few seconds for the script to run and yield the
desired results. Big sigh of relief: my investment in a powerful
computer was not in vain.

<snip>
> This is far afield from the question of word count, but may be
> useful someday.
>
> The beauty of a multiple processors is that you can run independent
> processes simultaneously (I'm not talking about multitasking).
<snip>
>  http://docs.python.org/library/threading.html
>  http://www.devshed.com/c/a/Python/Basic-Threading-in-Python/
>  http://www.dabeaz.com/python/GIL.pdf

VERY useful information, thanks!

> OK, on to your code.
>
>  : def countWords(wordlist):
>  :     word_table = {}
>  :     for word in wordlist:
>  :         count = wordlist.count(word)
>  :         print "word_table[%s] = %s" % (word,word_table.get(word,'<none>'))
>  :         word_table[word] = count
>
> Problem 1:  You aren't returning anything from this function.
>  Add:
>       return word_table

Sorry, since I had a lot of comments on my code (I'm learning and I
want to document profusely everything I do so that I don't have to
reinvent the wheel every time I try to do something) and before
posting it here I did a lot of deleting. Unintentionally I deleted the
following line (suggested in Steve's original message) that contained
the return:

 return sorted(word_table.items(), key=lambda item: item[1], reverse=True)

Even adding this, though, the process was taking too long and I had to
kill it. When I fixed my mistake in Peter Otten's code (see below)
everything worked like a charm.

By the way, I know what a lambda function is and I read about the key
parameter in sorted() but I don't understand very well what
"key=lambda item: item[1]" does. It has to do with taking the value
'1' as term for comparison, I guess, since this returns an ordered
list according to the number of times a word appears in the text going
from the most frequent to the less frequent and reverse=True is what
changes the order in which is sorted. What I don't understand is the
syntax of "item : item[1]".
<snip>

>  : def countWords2(wordlist): #as proposed by Peter Otten
>  :     word_table = {}
>  :     for word in wordlist:
>  :         if word in word_table:
>  :             word_table[word] += 1
>  :         else:
>  :             word_table[word] = 1
>  :         count = wordlist.count(word)
>  :         word_table[word] = count
>  :     return sorted(
>  :                   word_table.items(), key=lambda item: item[1], reverse=True
>  :                   )
>
> In the above, countWords2, why not omit these lines:
>
>  :         count = wordlist.count(word)
>  :         word_table[word] = count

Sorry this was my mistake and it is what was responsible for the
script hanging. This is the problem with cutting and pasting code and
not revising what you copied. I took Steve's code as the basis and
tried to modify it with Peter's code but then I forgot to delete these
two lines that were in Steve's code. Since it worked with the test I
did with the light file, I didn't even worry to check it. Live and
learn.
<snip>

> Let try a (bit of a labored) analogy of your problem.  To
> approximate your algorithm.
>
>  I have a clear tube with gumballs of a variety of colors.
>  I open up one end of the tube, and mark where I'm starting.

<snip>

This is what I said at the beginning. This little analogy was
pedagogically very sound. Thanks! I really appreciate (and I hope
others will do as well) your time.

<snip>
> Once you gain familiarity with the lists and dicts, you can try out
> collections, as suggested by Peter Otten.

The problem is that I'm using version 2.6.1. I have a Mac and I am
using a package called NLTK to process natural language. I tried to
install newer versions of Python on the Mac but the result was a mess.
The modules of NLTK worked well with the default Python installation
but not with the newer versions I installed. They recommend not to
delete the default version of Python in the Mac because it might be
used by the system or some applications. So I had to go back to the
Python version that comes installed by default in the Mac.

Josep M.