Histogram (was: When Python outruns C++ )
David Mertz
mertz at gnosis.cx
Tue Apr 1 00:48:07 EST 2003
Julian Tibble <chasm at rift.sytes.net> wrote previously:
...histogram program...
|file of size ~80MB, the C program outperformed the python one
|$ (time ./words < biginput) > /dev/null # 0m8.476s
|$ (time python words.py < biginput) > /dev/null # 1m59.747s
This much difference is surprising to me. I figured there had to
be something wrong with Tibble's Python version; but it seems
like it's not that bad. The fastest Python histogram program I
know is below (the example came up before :-))... but in a quick
test against a smaller 5MB file, the below takes over half the
time of Tibble's (FWIW, the version I present in my book,
<http://gnosis.cx/TPiP/chap2.txt>, is between the two in
speed--but mine has a few more options). I tested under 2.3a2,
FWIW.
------------------------------------------------------------------------
#!/usr/local/bin/python
# $Id: wordfreq.python,v 1.9 2001/05/11 17:44:00 doug Exp $
# http://www.bagley.org/~doug/shootout/
#
# adapted from Bill Lear's original python word frequency counter
#
# Joel Rosdahl suggested using translate table to speed up
# word splitting. That change alone sped this program up by
# at least a factor of 3.
#
# with further speedups from Mark Baker
import sys
def main():
count = {}
i_r = map(chr, range(256))
trans = [' '] * 256
o_a, o_z = ord('a'), (ord('z')+1)
trans[ord('A'):(ord('Z')+1)] = i_r[o_a:o_z]
trans[o_a:o_z] = i_r[o_a:o_z]
trans = ''.join(trans)
rl = sys.stdin.readlines
lines = rl(4095)
while lines:
for line in lines:
for word in line.translate(trans).split():
try:
count[word] += 1
except KeyError:
count[word] = 1
lines = rl(4095)
l = zip(count.values(), count.keys())
l.sort()
l.reverse()
print '\n'.join(["%7s\t%s" % (count, word) for (count, word) in l])
import time
start = time.clock()
main()
sys.stderr.write('%4.2f seconds\n' % (time.clock()-start))
--
mertz@ _/_/_/_/_/_/_/ THIS MESSAGE WAS BROUGHT TO YOU BY:_/_/_/_/ v i
gnosis _/_/ Postmodern Enterprises _/_/ s r
.cx _/_/ MAKERS OF CHAOS.... _/_/ i u
_/_/_/_/_/ LOOK FOR IT IN A NEIGHBORHOOD NEAR YOU_/_/_/_/_/ g s
More information about the Python-list
mailing list