[Tutor] count words

Tue Feb 15 20:39:30 CET 2005

Ryan Davis wrote:
> Here's one way to iterate over that to get the counts.  I'm sure there are dozens.
> ###
> 
>>>>x = 'asdf foo bar foo'
>>>>counts = {}
>>>>for word in x.split():
> 
> ...	counts[word] = x.count(word)
> ... 
> 
>>>>counts
> 
> {'foo': 2, 'bar': 1, 'asdf': 1}
> ###
> The dictionary takes care of duplicates.  If you are using a really big file, it might pay to eliminate duplicates from the list
> before running x.count(word)

Be wary of using the count() function for this, it can be very slow. The problem is that every time 
you call count(), Python has to look at every element of the list to see if it matches the word you 
passed to count(). So if the list has n words in it, you will make n*n comparisons. In contrast, the 
method that directly accumulates counts in a dictionary just makes one pass over the list. For small 
lists this doesn't matter much, but for a longer list you will definitely see the difference.

For example, I downloaded "The Art of War" from Project Gutenberg 
(http://www.gutenberg.org/dirs/1/3/132/132a.txt) and tried both methods. Here is a program that 
times how long it takes to do the counts using two different methods:

######### WordCountTest.py
''' Count words two different ways '''

from string import punctuation
from time import time

def countWithDict(words):
     ''' Word count by accumulating counts in a dictionary '''
     counts = {}
     for word in words:
         counts[word] = counts.get(word, 0) + 1

     return counts

def countWithCount(words):
     ''' Word count by calling count() for each word '''
     counts = {}
     for word in words:
         counts[word] = words.count(word)

     return counts

def timeOne(f, words):
     ''' Time how long it takes to do f(words) '''
     startTime = time()
     f(words)
     endTime = time()
     print '%s: %f' % (f.__name__, endTime-startTime)

# Get the word list and strip off punctuation
words = open(r'D:\Personal\Tutor\ArtOfWar.txt').read().split()
words = [ word.strip(punctuation) for word in words ]

# How many words is it, anyway?
print len(words), 'words'

# Time the word counts
c1 = timeOne(countWithDict, words)
c2 = timeOne(countWithCount, words)

# Check that both get the same result
assert c1 == c2

####################

The results (times are in seconds):
14253 words
countWithDict: 0.010000
countWithCount: 9.183000

It takes 900 times longer to count() each word individually!

Kent