[spambayes-dev] comment assertion error? revisitDBDictClassifierassumptions?

Tue Dec 23 21:11:48 EST 2003

[Kenny Pitt]

You're doing an excellent job of channeling Mark, and I have only a little
to add.  From a 5-mile view, we run a memory cache (which happens to be a
Python dict) on top of a disk-based database, in order that the system not
run too slow to bear.  The memory cache is effective at speeding normal
operation; that's why it's there.  It may err on the side of keeping too
much in memory.

> The comment appears in the _wordinfoset() function, which means it is
> called when a message is trained.  I believe the original reasoning
> was probably that there are a lot of tokens in a newly trained
> message that have never been seen before, and quite likely will never
> be seen again. It would be a waste of memory to cache lots of
> singleton tokens that will never be used to classify another message,
> so the token is saved to the database on disk but is discarded from
> the memory cache.  If the token is ever needed when classifying a
> message in the future, then it will be read in from the database and
> will then be kept in the memory cache.

All correct.

> Because the uni/bigram scheme generates so many more tokens from the
> same message, I would think this reasoning would apply even more so
> there.

Me too.

> This same caching scheme could be applied to any of the random-access
> database storage mechanisms, such as MySQL or Postgres.

That's right, and if looking up frequently reference tokens goes faster in a
dict than reading from disk (hint:  it does <wink>), it will help them too.

> It doesn't seem like it would apply to pickles, however, because
> the complete list of all known tokens is always kept in memory for a
> pickle.

Also right.  Skip, what you described before makes me wonder why you'd want
a disk-based database:

    I'm not too concerned about memory footprint of the classifier,
    since I have lots of memory
    ...
    I also wonder about the contention that it reduces the database
    store time.

If you want peak classification and/or training speed, have lots of memory,
and don't care about initialization or finalization time, running a plain
Python dict (stored as a giant binary pickle) is definitely the way to go.
It's much faster, and it was much faster still before we added layers of
indirection to *allow* dict operations to get satisfied by "real" databases
instead.

FWIW, the memory cache may not apply much to ZODB either, since ZODB keeps
accessed Python objects (which is what ZODB stores) in its own memory cache.

> Since PickledClassifier also derives from Classifier, I would have
> to vote against moving caching logic into the base Classifier class.
> Maybe a DBClassifierBase class derived from Classifier and containing
> the caching logic for all database storage mechanisms would be in
> order.

Of course different storage mechanisms may want different caching
strategies.

> Regarding the reduced store time, this "optimization" seems to be
> oriented towards a train-on-everything strategy and a long running
> application such as sb_server.  Keeping updates in memory means that
> the counts for a token can be updated multiple times with only one
> database write at the end, while writing out singletons immediately
> keeps the size of the change list down so that the database update
> doesn't take quite so long at shutdown.

It was really aimed at incremental training.  When you hit, e.g., the
"Delete as Spam" button in the Outlook addin with even just one msg
selected, the Berkeley db on disk is synch'ed after training.  This makes
for a *very* perceptible delay if the cache contains lots of info that
differs from what's on disk.  Startup and shutdown time are also important
in this context, and amortizing those costs has major "perceived usability"
benefits.

If, e.g., you run from a giant pickled Python dict instead, you can expect
to wait several seconds (at best) whenever loading it from, or storing it
to, disk.

> With the caching and optimization in the database engines being what
> it is today, it seems that we might be better off to always write
> changes to the DB immediately and dispense with the whole
> self.changed_words thing altogether.

This should be measured; it's not (or shouldn't be) a religious issue.  I
have no experience with general-purpose database engines that are actually
fast; only some that aren't as slow as others <0.5 wink>.

> When there are multiple processes that could be using the database
> at the same time, any caching (read or write) that we do ourselves
> outside the database engine has the potential to generate
> inconsistencies in the data anyway.

A conclusion there, one way or the other, depends on specific details.
Concurrent read-write access is never simple, and I'm not sure anyone uses
spambayes that way anyway.