Histogram (was: When Python outruns C++ )

David Mertz mertz at gnosis.cx
Tue Apr 1 00:48:07 EST 2003


Julian Tibble <chasm at rift.sytes.net> wrote previously:
...histogram program...
|file of size ~80MB, the C program outperformed the python one
|$ (time ./words < biginput) > /dev/null            #  0m8.476s
|$ (time python words.py < biginput) > /dev/null    # 1m59.747s

This much difference is surprising to me.  I figured there had to
be something wrong with Tibble's Python version; but it seems
like it's not that bad.  The fastest Python histogram program I
know is below (the example came up before :-))... but in a quick
test against a smaller 5MB file, the below takes over half the
time of Tibble's (FWIW, the version I present in my book,
<http://gnosis.cx/TPiP/chap2.txt>, is between the two in
speed--but mine has a few more options).  I tested under 2.3a2,
FWIW.

------------------------------------------------------------------------
#!/usr/local/bin/python
# $Id: wordfreq.python,v 1.9 2001/05/11 17:44:00 doug Exp $
# http://www.bagley.org/~doug/shootout/
#
# adapted from Bill Lear's original python word frequency counter
#
# Joel Rosdahl suggested using translate table to speed up
# word splitting.  That change alone sped this program up by
# at least a factor of 3.
#
# with further speedups from Mark Baker

import sys

def main():
    count = {}
    i_r = map(chr, range(256))

    trans = [' '] * 256
    o_a, o_z = ord('a'), (ord('z')+1)
    trans[ord('A'):(ord('Z')+1)] = i_r[o_a:o_z]
    trans[o_a:o_z] = i_r[o_a:o_z]
    trans = ''.join(trans)

    rl = sys.stdin.readlines

    lines = rl(4095)
    while lines:
        for line in lines:
            for word in line.translate(trans).split():
                try:
                    count[word] += 1
                except KeyError:
                    count[word] = 1
        lines = rl(4095)

    l = zip(count.values(), count.keys())
    l.sort()
    l.reverse()

    print '\n'.join(["%7s\t%s" % (count, word) for (count, word) in l])

import time
start = time.clock()
main()
sys.stderr.write('%4.2f seconds\n' % (time.clock()-start))


--
 mertz@   _/_/_/_/_/_/_/ THIS MESSAGE WAS BROUGHT TO YOU BY:_/_/_/_/ v i
gnosis  _/_/                    Postmodern Enterprises         _/_/  s r
.cx    _/_/  MAKERS OF CHAOS....                              _/_/   i u
      _/_/_/_/_/ LOOK FOR IT IN A NEIGHBORHOOD NEAR YOU_/_/_/_/_/    g s






More information about the Python-list mailing list