[Spambayes] Long load times.

Tony Meyer tameyer at ihug.co.nz
Sat Feb 7 21:05:16 EST 2004

> What determines the size of this file default_bayes_database.db ?

The number of unique tokens that are in all the messages that you have
trained.  In practice, given that messages tend to have the same tokens (or
this whole thing wouldn't work!), the number of messages that you have

> I'm trying to cut down on the long load times I have. Will 
> deleting my spam messages help? Will that decrease the size 
> of the above file?

Having a smaller database file could help, yes.  (The way to check this (and
see how much you gain) would be to rename the file, see how fast it loads
then (it'll create a new, empty, database), and then delete the new database
and put the old one back).

How many messages have you trained?  You can get quite good results with
just a couple of hundred of each - if you have several thousand of each,
then an easy way to fix this would be to retrain with a smaller sample.  For
example, I currently have 89 ham and 195 spam trained (note that it would be
better to have roughly equal numbers of ham and spam), and get good results

'Optimum' size of the ham & spam corpora is something that isn't really
known at this point in time.  If you are more concerned about load times,
then it would certainly be worth giving the 'minimal db' scheme a try.  For
example, only train on mistakes (false positives, false negatives, and
unsures), and see how that goes.

=Tony Meyer

* My database file is 5Mb, but I have two experimental options
(x-use_bigrams and x_slurp_urls with x-web_prefix) enabled that make the
database many times bigger than it otherwise would be.  If you only had ~400
messages trained I would expect that the file would be less than half that

