[Spambayes] The database question that would not die
Richie Hindle
richie@entrian.com
Sun Dec 1 23:49:38 2002
I've tried using bsddb3 on Windows, and the results are encouraging.
Testing with 500 spams, 500 hams and 500 unknowns looks like this:
Training 1000 Database size Classifying 500 Database load
Pickle 65 seconds 999,540 35 seconds 4 seconds
bsddb3 82 seconds 1,318,912 43 seconds (negligible)
Close enough on all counts, I'd say (and the startup time will be a bigger
and bigger win as the database grows). Small savings in time and space for
some operations aren't worth the hassle of having two formats, IMHO.
Here's what I did:
o Installed pybsddb, which gave me the bsddb3 module
o Created dbhash3.py, a duplicate of dbhash.py (16 lines of code) that
refers to bsddb3 rather than bsddb
o Changed anydbm.py to always use dbhash3 on Windows.
I can see a few possible objections:
o There may be platforms on which anydbm defaults to bsddb 1.85, but for
which installing bsddb3 is a pain. Any takers?
o Current pickle users may violently object to the (small?) time and space
losses incurred by switching to using an anydbm database (which may not
be bsddb3 on their platform). Any takers?
o Insisting on bsddb3 prevents closed-source use of the spambayes code
until Python 2.3 is released. I can't imagine anyone here objecting...?
I only mention this one for completeness.
o We should skip bsddb3 and go directly to ZODB. My feeling is that this
is possibly a good long-term goal, but at this stage it would be
premature.
o The dramatic fifth objection, which I haven't thought of but which
means this idea will never fly. Any takers? 8-)
So now I can ask the question that Neale (I think) asked a while ago - is
there any need to keep the pickle option?
I would LOVE for us to drop the pickle option before I submit my article to
the Linux Journal, which has to happen before Thursday 5th December.
Explaining the different database formats will be an embarrassment - much
better to simply say "Python 2.2 users on Windows also need to download
bsddb3 from <here>".
--
Richie Hindle
richie@entrian.com
More information about the Spambayes
mailing list