[Spambayes] Guidance re pickles versus DB for Outlook

Mon Nov 25 23:45:01 2002

Hi everyone (and tim1 <wink>)

  I've been thinking about the "database" to use for the Outlook plugin.  I
see two reasonable choices today: pickles and whatever anydbm picks up on
Windows.

My understanding is that the main trade-offs are that pickles are slow to
load, but lightening to use, whereas a database is fast(er) to load, but
slow to use.  IIRC, updating the probabilities was a real killer for a DB,
but this has recently died.

To be honest, my main motivation in even thinking about this is the terrible
things we are doing to Outlook's startup time.  My decent machine is taking
quite a few seconds longer to get outlook started - and this cost is worn
every time *any* application uses Outlook for anything at all.  If we do any
sort of training, we also pay this penalty shutting down, saving the pickle.
If we crash, we lose all recent training data.

So, I see two basic routes I can take:

* Move to a DB, but stick with a fully synchronous model.  We still wear the
DB load time at startup, but this should be reduced significantly.  We wear
the performance costs at runtime associated with the scoring, and do all
such scoring in the "foreground", and saving of the DB as necessary.

* Stick with pickles, but move to a threaded asynchronous model.  Messages
can be "queued" for scoring/training.  At startup, we spin a new thread to
load the pickle.  Any "missed" messages at startup, and all messages as they
arrive are queued for scoring and filtering.  If the pickle is loaded, then
it will generally appear synchronous, otherwise new messages may sit in your
inbox for a few seconds before they are removed.  When the pickle is
modified, a background thread copies the data, and starts writing.  We do
some smarts with renaming the previous versions, as Tim1 implicated.  There
would be support for synchronous calls too (eg, "show spam clues"), but in
general, asynch could be used.

I would appreciate some comments on this.  I am leaning towards the asynch
model, but it is clearly more complicated.  However, if moving to a DB
simply means we will have perf issues, just not at startup, then the
complexity would be warranted.

Any thoughts?  Fairy god-mothers? Magic answers?

Thanks,

Mark.