An IMHO interesting optimization question...

Wed Apr 24 22:00:09 EDT 2002

On 25 Apr 2002 01:15:14 +0200, Michael 'Mickey' Lauer <mickey at tm.informatik.uni-frankfurt.de> wrote:

>Hello,
>
>I have an optimiziation question. Given the following code
>which generates a word length statistic up to words of the length
><longWord>.
>
>=-------------------------------------------------=
>#!/usr/bin/python
>
>def count( word ):
>   "Increments the wordcounter"
>   length = len( word )
>   if length > longWord:
>       length = longWord
>   stats[length] += 1
>
>def printout():
>    "Formatted output of dictionary"
>    for i in xrange( 1, longWord ):
>        print "Length %d = #%d" % ( i, stats[i] )
>    print "Length >%d = #%d" % ( longWord, stats[longWord] )
>
># import libraries and check for help request
>
>import sys, re
>if len( sys.argv ) != 1:
>    print """
>wordStatistic.py 1.0
>--------------------
>Reads from stdin and prints a word statistic.
>Usage: cat <file> | wordStatistic.py
>"""
>    sys.exit( 0 )
>
>longWord = 15
>defWord = "[A-Za-z]+"
>
>stats = {}
>wordExp = re.compile( defWord )
>
># initialize dictionary, so we don't have to check for key-existance later
>for i in xrange( 1, longWord+1 ):
>    stats[i] = 0
>
># main loop. iterates over standard input as long as there's
># something coming in...
>while 1:
>    line = sys.stdin.readline()
>    if line:
>        for word in wordExp.findall( line ):
>            count( word )
>    else:
>        break
>=-------------------------------------------------=
>
>For a given document, this program takes 20 seconds on my machine.
>I now try to optimize it by eliminating the while loop. The 2nd version looks like:
>
>=-------------------------------------------------=
># main loop. iterates over standard input as long as there's
># something coming in...
>lines = sys.stdin.readlines()
>for line in lines:
>    for word in wordExp.findall( line ):
>        count( word )
>=-------------------------------------------------=
>
>This takes 14 seconds on my machine which seems to be quite an achievement.
>
>Now on to the next. I try to eliminate the for loop by substituting with a map statement,
>which is supposed to be faster (because it's looping in C). The 3rd version looks like:
>
>=-------------------------------------------------=
># main loop. iterates over standard input as long as there's
># something coming in...
>lines = sys.stdin.readlines()
>for line in lines:
>    map( count, wordExp.findall( line ) )
>=-------------------------------------------------=
>
>Now this takes 17 seconds on my machine. Wow? Why that? Why does it take longer?
>
>On to the final version, which transforms the outer loop into a map statement:
>
>=-------------------------------------------------=
># main loop. iterates over standard input as long as there's
># something coming in...
>map( lambda line: map( count, wordExp.findall( line ) ), sys.stdin.readlines() )
>=-------------------------------------------------=
>
>This is even worse - it takes 18 seconds on my computer. What am I missing? And do you
>know a version which would be even faster than the 2nd version?
>
For starters, I would put the loop inside a function, and use a local list and index it with
the integer lengths instead of using a globally referenced dictionary. Why hash values that
can only be 1..15? And I would do the count logic right in the loop, to avoid the function
call overhead. When the loop is done, return the list of counts.

I have to leave, but I'm sure there's more speedup possible. How big is your file?

Regards,
Bengt Richter