[spambayes-bugs] [ spambayes-Bugs-1053223 ] Blackberry faster than SpamBayes

SourceForge.net noreply at sourceforge.net
Fri Jan 21 04:20:01 CET 2005

Bugs item #1053223, was opened at 2004-10-25 04:06
Message generated for change (Settings changed) made by anadelonbrin
You can respond by visiting: 

Category: Outlook
Group: Binary 1.0
>Status: Pending
Resolution: None
Priority: 5
Submitted By: Benjamin W. Slivka (benslivka)
Assigned to: Nobody/Anonymous (nobody)
Summary: Blackberry faster than SpamBayes

Initial Comment:
I've been running SpamBayes for over a year on Outlook 
(now XP SP3).  I've also had a BlackBerry for three years 
(have had a 7780 since the summer).  Over time, I've 
noticed ever more Spam hitting my BlackBerry, even 
though when I get home and check my Outlook Inbox it 
is not there, but (correctly) in the Spam folder.

And recently I've noticed that when I select an email in 
the "maybe SPAM" folder and click the "Delete as Spam" 
button, it takes several seconds to move it to the SPAM 

So I'm guessing SpamBayes has a linear (or worse than 
linear) time algorithm for updating the spam database.

SpamBayes Outlook Addin Binary Version 1.0 (July 2004) 
reports that this training database status: Database has 
30326 good and 16916 spam.

I assume one solution is for me to delete the databases 
and retrain.

Have you also considered some kind of automatic 
pruning to keep the databases manageable?

Here are the SB DB file sizes:
     1,316 default_bayes_customize.ini
42,049,536 default_bayes_database.db
 2,506,752 default_message_database.db
     3,521 MS Exchange Settings.ini

Of course the other possibility is to modify your DB 
algorithms so their running time is O(LogN) or faster...

Thank you!
--Ben Slivka


Comment By: Tony Meyer (anadelonbrin)
Date: 2004-11-03 17:00

Logged In: YES 

1. Well, you do select the messages that end up in the db,
even though you don't do anything else.  Perhaps adjusting
the thresholds might help?  Or enabling the [Classifier]
x-use_bigrams option?  (Open the default_bayes_customize.ini
file in the data directory, and add the appropriate lines;
to be effective, this does need a retrain).

2.  There aren't any other files, so this is rather
perplexing.  There have been a couple of other reports of
slowdowns on the spambayes at python.org list recently, which
also are unresolved at the moment.  I'll update this once
more is figured out.

3.  Not much, at the moment :(.  Have you seen the material
at <http://entrian.com/sbwiki>?  That's mostly what we have
on training at the moment, or at least a good (lengthy)
summary.  Typically the action is to retrain from scratch. 
We're still looking into it, really.

5. Sorry.  I was under the impression that even with the
binary you could switch to a pickle, but a glance at the
code indicates that I was wrong.


Comment By: Benjamin W. Slivka (benslivka)
Date: 2004-10-26 14:14

Logged In: YES 

Dear Tony,

Thank you for the thorough response.

1) I didn't create the database -- that is happening "under 
the covers".  I assume SpamBayes built up that database all 
on its own?  I'm just an "end user" and I get a lot of email 
(and, unfortunately, a lot of spam -- ben **at** slivka 
**dot** com is too easy to figure out, I guess).  And there is 
no obvious way in the user interface for me to shrink or prune 
the database.  

2) I deleted the database files -- it didn't change the 
classification speed at all -- it's still noticeably slower than it 
was when I first installed SpamBayes (in 2003).  Are their 
some other files or registry settings I should reset/delete?

3) So you would have to redesign your database schema -- I 
get that.  But what advice to you have for users when the DB 
gets "too large"?

4) I'm all for you all experimenting with new techniques and 
training regimines!

5) I installed the Windows binary -- I'm not looking to 
experiment with other database choices!

Thank you!


Comment By: Tony Meyer (anadelonbrin)
Date: 2004-10-26 14:03

Logged In: YES 

1.  ~47000 messages is a very large database.  Generally, it
seems that the best results can be obtained from quite small
(under 1000) databases, which would remove this problem. 
The wiki has a lot of stuff about training strategies.

2.  Classification time shouldn't be particularly related to
the db size (training certainly is).  I don't know what the
system is that sends mail to the BlackBerry, but perhaps
adjusting the background filtering options could help with
this problem?

3.  There has been a little investigation into expiring
messages, but the research hasn't shown that it's
particularly helpful.  One major problem is that SpamBayes
relies on bags of tokens being added/removed as a set.  This
means that if we were to prune the database we would want to
remove whole messages, not individual tokens.  At the moment
we don't store this information, so it would mean a whole
new database/table.

4.  Alternate training regimes, which keep the database size
small, like 'train to exhaustion', are likely to be the best
solution for this sort of problem.  The 1.1 release of
SpamBayes will almost certainly have some sort of support
for (more easily) trying out different training regimes with
the Outlook plug-in/sb_server.

5.  The database is user-selected.  By default bsddb is used
(I have no idea what the access times for bsddb are meant to
be, but I'm google could pull up something).  You can,
however, use a pickle, MySQL, or Postgres SQL.  Any of these
might help, depending on your exact requirements.  A pickle
takes a lot more memory, and will be slow to load/save, but
very fast otherwise.


You can respond by visiting: 

More information about the Spambayes-bugs mailing list