[spambayes-dev] SpamBayes server compliant w/ spamassassin
skip at pobox.com
Sun Apr 25 00:32:32 EDT 2004
jkx> Where significant effort ?
jkx> I really miss something. Have you read the code i provided ? It
jkx> just serve as 1 single server (hammie filter) for a large number of
jkx> users. But all have their own database.
jkx> - one and only one server (not one per user !)
jkx> - every user have its own db
No, I admit I didn't read your code. I read your mail message and must have
not fully understood what you were after. My apologies.
jkx> Do you really want to open one UnixDomain socket per user ?????
Sure, why not? Unix domain sockets are pretty cheap.
jkx> I usually work w/ about 50 users right now .. ( and i wrote this
jkx> code to do on ~ 1000 accounts .. ).
jkx> Another thing, i don't care about 'general database' .. this isn't
jkx> the goal i want a system managable for a large number of user..
I don't think a shared database would work except for a very close group of
users (very similar ideas of what constitutes ham and spam). How do your
users train their databases? I presume you are doing all this on your mail
server. Are your users local or remote?
>> Once you have that working, you can rewrite sb_bnfilter.py in C to
>> reduce memory consumption and maybe improve performance a bit.
>> sb_bnserver.py could probably be sped up just by running it with
jkx> pscyco have nothing about that. the trouble is 'exec a python' at
jkx> each email
I don't see 'exec a python' as a huge problem. Presumably on a busy server
the python interpreter and all the compiled bytecode will just be sitting in
memory buffers awaiting activation. Lots of systems do the equivalent of
'exec a python' or more on a per message basis. Have you tried it? Was it
jkx> so even it the server falls for a strange raison mails aren't lost
jkx> .. (spamc do that perfectly )
I'd rather trust my mail's delivery to procmail. If sb_bn*.py craps out,
procmail is there to recover the message for me. So far that combination
has been very robust. It processes between 2,000 and 3,000 messages daily
(about 70% spam) for me on my laptop without a hiccup. I generally don't
even notice that it's running.
I just ran a quick test of sb_bnfilter.py on my laptop. In a directory
containing 501 spams (between 24 and 3080 lines each, average 142 lines) I
for f in `find . -type f` ; do
time sb_bnfilter.py < $f > /dev/null
done 2>&1 | egrep real | sed -e 's/[^0-9.]//g' > ~/tmp/times.txt
The minimum real time was 0.180 seconds. The maximum was 1.057 seconds.
The mean time was 0.260 seconds.
I then tried it with a byte-compiled version of sb_bnfilter.py:
for f in `find . -type f` ; do
time python ~/local/bin/sb_bnfilter.pyc < $f > /dev/null
done 2>&1 | egrep real | sed -e 's/[^0-9.]//g' > ~/tmp/times2.txt
The times improved slightly: min 0.172, max 0.957, mean 0.241.
I then tried a third test, adding -A 1000 to the sb_bnfilter.py command line
in the second test to keep a single sb_bnserver.py running for the entire
test. Results: min 0.169, max 0.841, mean 0.236. I'd try the psyco test
but my laptop is a Mac.
Presumably performance would also improve on a more serious mail server.
What's your target processing time per message?
More information about the spambayes-dev