[Tutor] count words
Kent Johnson
kent37 at tds.net
Tue Feb 15 20:39:30 CET 2005
Ryan Davis wrote:
> Here's one way to iterate over that to get the counts. I'm sure there are dozens.
> ###
>
>>>>x = 'asdf foo bar foo'
>>>>counts = {}
>>>>for word in x.split():
>
> ... counts[word] = x.count(word)
> ...
>
>>>>counts
>
> {'foo': 2, 'bar': 1, 'asdf': 1}
> ###
> The dictionary takes care of duplicates. If you are using a really big file, it might pay to eliminate duplicates from the list
> before running x.count(word)
Be wary of using the count() function for this, it can be very slow. The problem is that every time
you call count(), Python has to look at every element of the list to see if it matches the word you
passed to count(). So if the list has n words in it, you will make n*n comparisons. In contrast, the
method that directly accumulates counts in a dictionary just makes one pass over the list. For small
lists this doesn't matter much, but for a longer list you will definitely see the difference.
For example, I downloaded "The Art of War" from Project Gutenberg
(http://www.gutenberg.org/dirs/1/3/132/132a.txt) and tried both methods. Here is a program that
times how long it takes to do the counts using two different methods:
######### WordCountTest.py
''' Count words two different ways '''
from string import punctuation
from time import time
def countWithDict(words):
''' Word count by accumulating counts in a dictionary '''
counts = {}
for word in words:
counts[word] = counts.get(word, 0) + 1
return counts
def countWithCount(words):
''' Word count by calling count() for each word '''
counts = {}
for word in words:
counts[word] = words.count(word)
return counts
def timeOne(f, words):
''' Time how long it takes to do f(words) '''
startTime = time()
f(words)
endTime = time()
print '%s: %f' % (f.__name__, endTime-startTime)
# Get the word list and strip off punctuation
words = open(r'D:\Personal\Tutor\ArtOfWar.txt').read().split()
words = [ word.strip(punctuation) for word in words ]
# How many words is it, anyway?
print len(words), 'words'
# Time the word counts
c1 = timeOne(countWithDict, words)
c2 = timeOne(countWithCount, words)
# Check that both get the same result
assert c1 == c2
####################
The results (times are in seconds):
14253 words
countWithDict: 0.010000
countWithCount: 9.183000
It takes 900 times longer to count() each word individually!
Kent
More information about the Tutor
mailing list