[Tutor] Word count help

D-Man dsh8290@rit.edu
Thu, 1 Feb 2001 19:15:47 -0500

I have looked through the code you posted, and I think I know why it
is so slow.  You use lists a lot in your code, and use function calls
such as list.count().  AFAIK list.count() will traverse through the
entire list every time it is called.  When the list gets to be large,
this will take a long time.

To speed this up, I would recommend using a dictionary to hold all the
words and their counts.  Dictionary lookups are fast since they use a
hashing function to find the key.

In this dictionary, I would use the word as the key, and have the
associated value be the number of occurences.

(I posted a perl script I wrote for class last quarter, but I think a
moderator dropped the message)

I have now translated that perl script to python.  I use more pythonic
techniques, and also use more punctuation marks for delimiting words
(the result is that python does more work because it finds more
words).  (In the perl script I made my own regex with the most common
chars, in the python script I use the predefined constants in the
string module).

For timing, I first prepared a file to count (the bash manpage):

$ man bash > man_bash
$ cp man_bash dup
$ cat dup >> man_bash
(3 times)
$ rm dup

The file size is now 1,211KB as reported by Windows Explorer.  The
file has 24552 lines in it.  I'm using a PII 400 (I might be a little
off on the clock speed, it's not my personal machinge) with Win2k and
cygwin.  I ran the interpreter from the bash shell.  The python
interpreter is compiled for windows, the perl one came with cygwin
(compiled for cygwin). 

I ran 

time perl count.pl < man_bash > /dev/null
time python count.py < man_bash > /dev/null

to get a rough idea how long the scripts take to run.

My results:

real    0m9.233s
user    0m8.522s
sys     0m0.090s

real    0m19.779s
user    0m0.010s
sys     0m0.010s

Just kind of interesting, the first time I timed it, I used the wrong
filename (a file that doesn't exist).  Perl took longer to report the
error than python.

bash: man_bashrc: No such file or directory
real    0m0.050s
user    0m0.030s
sys     0m0.020s

bash: man_bashrc: No such file or directory
real    0m0.010s
user    0m0.000s
sys     0m0.010s

My entire script (except for the print statements) could be placed
inside a function that takes a filename as an argument, and returns
the dictionary of words if you wanted to use it as part of a larger

I hope this helps you to learn more.  If you have any questions, don't
hesitate to ask.


Here is the script that I used:

import sys
import string
import re

# initialize the counters
linecount = 0
charcount = 0
wordcount = 0
# this is where I will store the total count of each word
words     = { }

# iterate over each line on stdin
for line in sys.stdin.readlines() :
    linecount += 1
    charcount += len( line )

    # remove leading and trailing whitespace
    line = string.strip( line ) 

    # split the string into a list of words
    # a word is delimited by whitespace or punctuation
            #"[.,:;?! \t\n]+" , # this is the regex used in my perl version
    for word in re.split(
            "[" + string.whitespace + string.punctuation + "]+" ,
            line ) :

        # make the word lower case
        word = string.lower( word )

        # check to make sure the string is considered a word
        if re.match( "^[" + string.lowercase + "]+$" , word ) :
            wordcount += 1

            # if the word has been found before, increment its count
            # otherwise initialize its count to 1
            if words.has_key( word ) :
                words[ word ] += 1
            else :
                words[ word ] = 1

# Now print out the results of the count:
print "Number of lines:" , linecount
print "Total word count:" , wordcount
print "Total character count:" , charcount

# print each word and its count in sorted order
sorted_word_list = words.keys()

for word in sorted_word_list :
    print word , ":" , words[ word ]


The script's output from the man_bash file I used (identical to the
perl version's, except for the extra carriage returns that result from
Windows) looks like:

Number of lines: 24552
Total word count: 116000
Total character count: 1239660

a : 3236
abbreviates : 12
abe : 4
ability : 4
able : 40
abled : 4
ables : 20
abling : 4
abort : 8
about : 32
above : 196
absence : 4
absent : 4
absolute : 8

PS.  I recently learned that all Usenet posts are automatically
     copyrighted by the poster.  In light of this, I give explicit
     permission to use, copy, distribute, and mutilate my sample code
     to your heart's desire.  :-)