[Spambayes] The database question that would not die

Richie Hindle richie@entrian.com
Sun Dec 1 23:49:38 2002


I've tried using bsddb3 on Windows, and the results are encouraging.
Testing with 500 spams, 500 hams and 500 unknowns looks like this:

          Training 1000  Database size  Classifying 500   Database load
Pickle      65 seconds    999,540          35 seconds       4 seconds
bsddb3      82 seconds    1,318,912        43 seconds       (negligible)

Close enough on all counts, I'd say (and the startup time will be a bigger
and bigger win as the database grows).  Small savings in time and space for
some operations aren't worth the hassle of having two formats, IMHO.

Here's what I did:

 o Installed pybsddb, which gave me the bsddb3 module
 o Created dbhash3.py, a duplicate of dbhash.py (16 lines of code) that
   refers to bsddb3 rather than bsddb
 o Changed anydbm.py to always use dbhash3 on Windows.

I can see a few possible objections:

 o There may be platforms on which anydbm defaults to bsddb 1.85, but for
   which installing bsddb3 is a pain.  Any takers?

 o Current pickle users may violently object to the (small?) time and space
   losses incurred by switching to using an anydbm database (which may not
   be bsddb3 on their platform).  Any takers?

 o Insisting on bsddb3 prevents closed-source use of the spambayes code
   until Python 2.3 is released.  I can't imagine anyone here objecting...?
   I only mention this one for completeness.

 o We should skip bsddb3 and go directly to ZODB.  My feeling is that this
   is possibly a good long-term goal, but at this stage it would be
   premature.

 o The dramatic fifth objection, which I haven't thought of but which
   means this idea will never fly.  Any takers?  8-)

So now I can ask the question that Neale (I think) asked a while ago - is
there any need to keep the pickle option?

I would LOVE for us to drop the pickle option before I submit my article to
the Linux Journal, which has to happen before Thursday 5th December.
Explaining the different database formats will be an embarrassment - much
better to simply say "Python 2.2 users on Windows also need to download
bsddb3 from <here>".

-- 
Richie Hindle
richie@entrian.com




More information about the Spambayes mailing list