[Spambayes] Guidance re pickles versus DB for Outlook
Tim Peters
tim.one@comcast.net
Tue Nov 26 02:52:25 2002
[Mark Hammond]
> I've been thinking about the "database" to use for the Outlook
> plugin. I see two reasonable choices today: pickles and whatever
> anydbm picks up on Windows.
Then I think we're stuck with pickles for now. On Windows, anydbm picks up
the ancient 1.85 bsddb we (PLabs) ship with the Windows installer, and
that's got nasty bugs no matter how you drive it:
http://www.sleepycat.com/historic.html
SourceForge is littered with reports of "mysterious failures" of the bsddb
code on Windows; it just isn't reliable.
ZODB is, and that's what Jeremy is using, while the neil*.py code in the
project is Neil Schemenauer's implementation of a CDB-based approach.
Windows Python 2.3 will ship with a modern bsddb, but that's no help right
now. (BTW, as long as you're sitting idle <wink>, follow the instructions
in CVS PCbuild\readme.txt for building the new bsddb code, and let me know
what you think about the 4 linker warnings we get -- I don't know whether to
be worried or not, and I don't know how to get rid of them either short of
giving up on the static-link version of the Berkeley code, + building &
linking distinct Release and Debug versions of the latter)
I'm not really worried about the scoring time with a DB -- "a real" DB has
its own caching schemes to speed frequently accessed items, our project
appears to have grown some form of dict-based spamprob cache of its own, and
scoring has always been a minor part of the total time burden anyway.
> ...
> To be honest,
I'm not sure that's allowed ... let me ask ... OK, you're cleared!
> my main motivation in even thinking about this is the terrible
> things we are doing to Outlook's startup time. My decent machine
> is taking quite a few seconds longer to get outlook started - and
> this cost is worn every time *any* application uses Outlook for
> anything at all. If we do any sort of training, we also pay this
> penalty shutting down, saving the pickle. If we crash, we lose all
> recent training data.
Yup, those are all things a real DB avoids.
> So, I see two basic routes I can take:
>
> * Move to a DB, but stick with a fully synchronous model. We
> still wear the DB load time at startup, but this should be reduced
> significantly.
Oh yes.
> We wear the performance costs at runtime associated with the scoring,
Not worried.
> and do all such scoring in the "foreground", and saving of the DB as
> necessary.
The classifier internals have been fiddled (by others -- and thanks!) so
that only words whose counts have changed need to be updated, and updating
100-or-so records after training is cheap.
> * Stick with pickles, but move to a threaded asynchronous model.
> Messages can be "queued" for scoring/training. At startup, we spin
> a new thread to load the pickle. Any "missed" messages at startup,
> and all messages as they arrive are queued for scoring and filtering.
> If the pickle is loaded, then it will generally appear synchronous,
> otherwise new messages may sit in your inbox for a few seconds
> before they are removed. When the pickle is modified, a background
> thread copies the data, and starts writing. We do some smarts with
> renaming the previous versions, as Tim1 implicated. There
> would be support for synchronous calls too (eg, "show spam clues"),
> but in general, asynch could be used.
That should work fine, and I'll sign up for *anything* that gets you to use
the lovely Queue module for real <wink>. Over the long haul I'm not sure it
will fly, because we still have no way to prune the database over time, and
indeed got rid of the WordInfo fields that were intended to make this
possible in an effective way. So the dict keeps growing, and saving it away
keeps taking longer. But spinning that off to a thread should hide the pain
for a long time, and we'll solve the pruning problem in the meantime <heh>.
> ...
> Any thoughts? Fairy god-mothers? Magic answers?
Today it's pickled dicts, ZODB, or roll-our-own on Windows. In 2.3, bsddb
becomes a real possiblity.
More information about the Spambayes
mailing list