[Spambayes] Corrupt database
Tim Peters
tim.one at comcast.net
Fri Jan 30 22:49:25 EST 2004
[Rhesa Rozendaal]
>> I've switched to using a pickle in the mean time, and I must
>> say that I do not really notice any changes in speed. Is there
>> a good reason why the pickle isn't the default? That would seem more
>> user-friendly to me.
[Tony Meyer]
> Well, bsddb should be the better option for the sort of db use that
> SpamBayes needs. I believe that speed changes could be quite
> noticeable in certain situations (slower machines, larger databases,
> and so on). OTOH, that is one of the options that the developers has
> discussed. We'd really like to get this solved, though :)
A pickled dict generally runs much faster than a database for heavy scoring
and heavy batch training. That's because the dict is entirely in memory.
But for the same reason, initial startup time for a pickled dict is much
longer (the entire dict has to be read from disk and loaded into memory);
more memory is required during operation (the entire dict remains in memory
forever); and saving the pickled dict again is much slower (the entire dict
has to be converted to a giant string and written to disk in one gulp; a
database allows for updating individual token statistics).
So people running a high-volume filter daemon on a server-class machine
would be better off with a pickled dict (they don't care about startup or
shutdown time, have plenty of RAM, and scoring speed matters in high-volume
applications). People firing up spambayes often probably couldn't tolerate
the slow startup time of a pickled dict, and people low on RAM couldn't
afford the memory hit.
More RAM and faster CPU are recommended for all purposes <wink>.
More information about the Spambayes
mailing list