[spambayes-dev] improving dumbdbm's survival chances...

Tim Peters tim.one at comcast.net
Tue Jul 15 11:54:51 EDT 2003

[G. Armour Van Horn]
> Hey, I'm nowhere near "Tim's sister" capability, but I still want to
> just download the zip, extract it, and run the proxy. If dumbdbm is a
> dumb way to go, it shouldn't be the default.

It's a last resort, but it shouldn't be even that.

> I wouldn't be too upset to be retraining, I've only been running this
> install for a week and could just start from scratch again. I was
> planning on keeping up the training for a week or so anyway, although
> my database is already up to 27 megs.

The large size of your database is (just) one of the bad consequences of
using dumbdbm.  A dumbdbm database consists of a .dir file and a .dat file,
and I assume your 27MB refers to the .dat file.  The .dir file holds keys
and the .dat file values.  A dumbdbm .dat file consumes at least 512 bytes
for each value, so a 27MB .dat file can't represent more than about 50,000
tokens -- which is actually on the small end for a spambayes database.

That's outrageous overhead, since there's only about 4 bytes of information
in a spambayes database value, 128x smaller than dumbdbm requires.  Now only
a custom-designed database could actually achieve that, but a good
general-purpose database should be able to get away with much less than 512
bytes per spambayes database value.

