[Spambayes] some preliminary timings
Skip Montanaro
skip at pobox.com
Mon Feb 24 16:36:03 EST 2003
I just ran some simple(-minded) performance tests on my machine (Apple
TiPowerbook, 800MHz, Mac OS 10.2.4).
Using a simple filter loop from the interpreter:
>>> from spambayes import mboxutils
>>> fp = open("/Users/skip/tmp/newham.clean.save")
>>> mbox = mailbox.PortableUnixMailbox (fp, mboxutils.get_message )
>>> i = 0
>>> t = time.clock()
>>> for msg in mbox:
... x = h.filter(msg)
... i += 1
...
t = time.clock()-t
>>> t = time.clock()-t
>>> i
13389
>>> t
291.63
>>> i/t
45.910914514967594
You can see I was able to filter roughly 46 messages per cpu second, or 0.02
cpu seconds per message. I then wrote the first ten messages of the above
mailbox to files:
>>> fp = open("/Users/skip/tmp/newham.clean.save")
>>> msgs = []
>>> mbox = mailbox.PortableUnixMailbox (fp, mboxutils.get_message )
>>> msgs.append(mbox.next())
>>> msgs.append(mbox.next())
>>> msgs.append(mbox.next())
>>> msgs.append(mbox.next())
>>> msgs.append(mbox.next())
>>> msgs.append(mbox.next())
>>> msgs.append(mbox.next())
>>> msgs.append(mbox.next())
>>> msgs.append(mbox.next())
>>> msgs.append(mbox.next())
>>> i = 1
>>> for m in msgs:
... f = open("msg%02d" % i, "w")
... f.write(m.as_string())
... i += 1
...
ran a simple hammiefilter loop from the shell:
% for f in msg?? ; do
> time hammiefilter.py -d ~/hammie.db < $f > /dev/null
> done
and got
real 0m0.464s
user 0m0.240s
sys 0m0.170s
real 0m0.441s
user 0m0.260s
sys 0m0.130s
real 0m0.442s
user 0m0.240s
sys 0m0.150s
real 0m0.443s
user 0m0.300s
sys 0m0.100s
real 0m0.580s
user 0m0.370s
sys 0m0.150s
real 0m0.535s
user 0m0.370s
sys 0m0.090s
real 0m0.501s
user 0m0.280s
sys 0m0.140s
real 0m0.504s
user 0m0.340s
sys 0m0.110s
real 0m0.638s
user 0m0.410s
sys 0m0.180s
real 0m0.450s
user 0m0.290s
sys 0m0.100s
for an average wallclock time of 0.5 seconds per message. Considering just
user+sys times (to more accurately compare what time.clock() returns)
brings the per-message time down to 0.44 seconds.
Such a huge difference between hammiefilter and a raw filter loop suggests I
may have done something wrong. Still, perhaps opening the db file for each
message and all the imports hammiefilter has to do simply kills the
performance.
In an attempt to minimze the effect of byte compiling hammiefilter.py each
time, I imported it once from the interpreter, then changed the loop to
% for f in msg?? ; do
> time python hammiefilter.pyc -d ~/hammie.db < $f > /dev/null
> done
This generated
real 0m0.474s
user 0m0.230s
sys 0m0.160s
real 0m0.468s
user 0m0.240s
sys 0m0.150s
real 0m0.465s
user 0m0.250s
sys 0m0.100s
real 0m0.460s
user 0m0.270s
sys 0m0.110s
real 0m0.602s
user 0m0.370s
sys 0m0.160s
real 0m0.556s
user 0m0.340s
sys 0m0.120s
real 0m0.518s
user 0m0.320s
sys 0m0.110s
real 0m0.522s
user 0m0.310s
sys 0m0.110s
real 0m0.652s
user 0m0.410s
sys 0m0.150s
real 0m0.473s
user 0m0.300s
sys 0m0.100s
for an average wallclock time of 0.52 seconds and average user+sys time of
0.43 seconds, that is, not really much of an improvement at all. This was
with a 2.3a1+ version of Python.
Skip
More information about the Spambayes
mailing list