[Spambayes] Corrupt database

Tim Peters tim.one at comcast.net
Fri Jan 30 22:49:25 EST 2004


[Rhesa Rozendaal]
>> I've switched to using a pickle in the mean time, and I must
>> say that I do not really notice any changes in speed. Is there
>> a good reason why the pickle isn't the default? That would seem more
>> user-friendly to me.

[Tony Meyer]
> Well, bsddb should be the better option for the sort of db use that
> SpamBayes needs.  I believe that speed changes could be quite
> noticeable in certain situations (slower machines, larger databases,
> and so on).  OTOH, that is one of the options that the developers has
> discussed.  We'd really like to get this solved, though :)

A pickled dict generally runs much faster than a database for heavy scoring
and heavy batch training.  That's because the dict is entirely in memory.

But for the same reason, initial startup time for a pickled dict is much
longer (the entire dict has to be read from disk and loaded into memory);
more memory is required during operation (the entire dict remains in memory
forever); and saving the pickled dict again is much slower (the entire dict
has to be converted to a giant string and written to disk in one gulp; a
database allows for updating individual token statistics).

So people running a high-volume filter daemon on a server-class machine
would be better off with a pickled dict (they don't care about startup or
shutdown time, have plenty of RAM, and scoring speed matters in high-volume
applications).  People firing up spambayes often probably couldn't tolerate
the slow startup time of a pickled dict, and people low on RAM couldn't
afford the memory hit.

More RAM and faster CPU are recommended for all purposes <wink>.




More information about the Spambayes mailing list