[spambayes-dev] SpamBayes server compliant w/ spamassassin

Skip Montanaro skip at pobox.com
Sun Apr 25 00:32:32 EDT 2004


    jkx> Where significant effort ? 

    jkx> I really miss something. Have you read the code i provided ?  It
    jkx> just serve as 1 single server (hammie filter) for a large number of
    jkx> users. But all have their own database.
    jkx> - one and only one server (not one per user !)
    jkx> - every user have its own db 

No, I admit I didn't read your code.  I read your mail message and must have
not fully understood what you were after.  My apologies.

    jkx> Do  you really want to open one UnixDomain socket per user ????? 

Sure, why not?  Unix domain sockets are pretty cheap.

    jkx> I usually work w/ about 50 users right now ..  ( and i wrote this
    jkx> code to do on ~ 1000 accounts .. ).

    jkx> Another thing, i don't care about 'general database' .. this isn't
    jkx> the goal i want a system managable for a large number of user..

I don't think a shared database would work except for a very close group of
users (very similar ideas of what constitutes ham and spam).  How do your
users train their databases?  I presume you are doing all this on your mail
server.  Are your users local or remote?

    >> Once you have that working, you can rewrite sb_bnfilter.py in C to
    >> reduce memory consumption and maybe improve performance a bit.
    >> sb_bnserver.py could probably be sped up just by running it with
    >> psyco.

    jkx> pscyco have nothing about that. the trouble is 'exec a python' at
    jkx> each email

I don't see 'exec a python' as a huge problem.  Presumably on a busy server
the python interpreter and all the compiled bytecode will just be sitting in
memory buffers awaiting activation.  Lots of systems do the equivalent of
'exec a python' or more on a per message basis.  Have you tried it?  Was it
too slow?

    jkx> so even it the server falls for a strange raison mails aren't lost
    jkx> .. (spamc do that perfectly )

I'd rather trust my mail's delivery to procmail.  If sb_bn*.py craps out,
procmail is there to recover the message for me.  So far that combination
has been very robust.  It processes between 2,000 and 3,000 messages daily
(about 70% spam) for me on my laptop without a hiccup.  I generally don't
even notice that it's running.

I just ran a quick test of sb_bnfilter.py on my laptop.  In a directory
containing 501 spams (between 24 and 3080 lines each, average 142 lines) I
executed:

    for f in `find . -type f` ; do
        time sb_bnfilter.py < $f > /dev/null
    done 2>&1 | egrep real | sed -e 's/[^0-9.]//g' > ~/tmp/times.txt

The minimum real time was 0.180 seconds.  The maximum was 1.057 seconds.
The mean time was 0.260 seconds.  

I then tried it with a byte-compiled version of sb_bnfilter.py:

    for f in `find . -type f` ; do
        time python ~/local/bin/sb_bnfilter.pyc < $f > /dev/null
    done 2>&1 | egrep real | sed -e 's/[^0-9.]//g' > ~/tmp/times2.txt

The times improved slightly: min 0.172, max 0.957, mean 0.241.

I then tried a third test, adding -A 1000 to the sb_bnfilter.py command line
in the second test to keep a single sb_bnserver.py running for the entire
test.  Results: min 0.169, max 0.841, mean 0.236.  I'd try the psyco test
but my laptop is a Mac.

Presumably performance would also improve on a more serious mail server.
What's your target processing time per message?

Skip



More information about the spambayes-dev mailing list