[Spambayes] some preliminary timings

Skip Montanaro skip at pobox.com
Mon Feb 24 16:36:03 EST 2003


I just ran some simple(-minded) performance tests on my machine (Apple
TiPowerbook, 800MHz, Mac OS 10.2.4).

Using a simple filter loop from the interpreter:

    >>> from spambayes import mboxutils
    >>> fp = open("/Users/skip/tmp/newham.clean.save")
    >>> mbox = mailbox.PortableUnixMailbox (fp, mboxutils.get_message )
    >>> i = 0
    >>> t = time.clock()
    >>> for msg in mbox:
    ...   x = h.filter(msg)
    ...   i += 1
    ... 
    t = time.clock()-t
    >>> t = time.clock()-t
    >>> i
    13389
    >>> t
    291.63
    >>> i/t
    45.910914514967594

You can see I was able to filter roughly 46 messages per cpu second, or 0.02
cpu seconds per message.  I then wrote the first ten messages of the above
mailbox to files:

    >>> fp = open("/Users/skip/tmp/newham.clean.save")
    >>> msgs = []
    >>> mbox = mailbox.PortableUnixMailbox (fp, mboxutils.get_message )
    >>> msgs.append(mbox.next())
    >>> msgs.append(mbox.next())
    >>> msgs.append(mbox.next())
    >>> msgs.append(mbox.next())
    >>> msgs.append(mbox.next())
    >>> msgs.append(mbox.next())
    >>> msgs.append(mbox.next())
    >>> msgs.append(mbox.next())
    >>> msgs.append(mbox.next())
    >>> msgs.append(mbox.next())
    >>> i = 1
    >>> for m in msgs:
    ...   f = open("msg%02d" % i, "w")
    ...   f.write(m.as_string())
    ...   i += 1
    ... 

ran a simple hammiefilter loop from the shell:

    % for f in msg?? ; do
    > time hammiefilter.py -d ~/hammie.db < $f > /dev/null
    > done

and got 

    real    0m0.464s
    user    0m0.240s
    sys     0m0.170s

    real    0m0.441s
    user    0m0.260s
    sys     0m0.130s

    real    0m0.442s
    user    0m0.240s
    sys     0m0.150s

    real    0m0.443s
    user    0m0.300s
    sys     0m0.100s

    real    0m0.580s
    user    0m0.370s
    sys     0m0.150s

    real    0m0.535s
    user    0m0.370s
    sys     0m0.090s

    real    0m0.501s
    user    0m0.280s
    sys     0m0.140s

    real    0m0.504s
    user    0m0.340s
    sys     0m0.110s

    real    0m0.638s
    user    0m0.410s
    sys     0m0.180s

    real    0m0.450s
    user    0m0.290s
    sys     0m0.100s

for an average wallclock time of 0.5 seconds per message.  Considering just
user+sys times (to more accurately compare what time.clock() returns)
brings the per-message time down to 0.44 seconds.

Such a huge difference between hammiefilter and a raw filter loop suggests I
may have done something wrong.  Still, perhaps opening the db file for each
message and all the imports hammiefilter has to do simply kills the
performance.

In an attempt to minimze the effect of byte compiling hammiefilter.py each
time, I imported it once from the interpreter, then changed the loop to

    % for f in msg?? ; do
    > time python hammiefilter.pyc -d ~/hammie.db < $f > /dev/null
    > done

This generated

    real    0m0.474s
    user    0m0.230s
    sys     0m0.160s

    real    0m0.468s
    user    0m0.240s
    sys     0m0.150s

    real    0m0.465s
    user    0m0.250s
    sys     0m0.100s

    real    0m0.460s
    user    0m0.270s
    sys     0m0.110s

    real    0m0.602s
    user    0m0.370s
    sys     0m0.160s

    real    0m0.556s
    user    0m0.340s
    sys     0m0.120s

    real    0m0.518s
    user    0m0.320s
    sys     0m0.110s

    real    0m0.522s
    user    0m0.310s
    sys     0m0.110s

    real    0m0.652s
    user    0m0.410s
    sys     0m0.150s

    real    0m0.473s
    user    0m0.300s
    sys     0m0.100s

for an average wallclock time of 0.52 seconds and average user+sys time of
0.43 seconds, that is, not really much of an improvement at all.  This was
with a 2.3a1+ version of Python.

Skip



More information about the Spambayes mailing list