[spambayes-dev] comment assertion error? revisit DBDictClassifierassumptions?

Tue Dec 23 16:04:10 EST 2003

Tim Stone wrote:
> On Tue, 23 Dec 2003 09:43:49 -0600, Skip Montanaro <skip at pobox.com>
> wrote: 
> 
>> Perhaps we should reexamine the caching in DBDictClassifier.  I
>> would like it to be able to inherit a bit more functionality from
>> its base class. If the assumptions it makes aren't entirely
>> accurate, much of the extra work maintaining caches might be avoided.
> 
> I have no idea where that comment came from... The scheme seems bogus
> to me.  It's a word, it occurs once or many times, there's no reason
> to treat it differently.  If we have memory consumption problems,
> then that's the problem to fix..  We've had a bunch of discussion
> about using other db systems (zodb, mysql, etc.).  Perhaps this is
> yet another reason to "modernize" our database.

The comment appears in the _wordinfoset() function, which means it is
called when a message is trained.  I believe the original reasoning was
probably that there are a lot of tokens in a newly trained message that
have never been seen before, and quite likely will never be seen again.
It would be a waste of memory to cache lots of singleton tokens that
will never be used to classify another message, so the token is saved to
the database on disk but is discarded from the memory cache.  If the
token is ever needed when classifying a message in the future, then it
will be read in from the database and will then be kept in the memory
cache.

Because the uni/bigram scheme generates so many more tokens from the
same message, I would think this reasoning would apply even more so
there.

This same caching scheme could be applied to any of the random-access
database storage mechanisms, such as MySQL or Postgres.  It doesn't seem
like it would apply to pickles, however, because the complete list of
all known tokens is always kept in memory for a pickle.  Since
PickledClassifier also derives from Classifier, I would have to vote
against moving caching logic into the base Classifier class.  Maybe a
DBClassifierBase class derived from Classifier and containing the
caching logic for all database storage mechanisms would be in order.

Regarding the reduced store time, this "optimization" seems to be
oriented towards a train-on-everything strategy and a long running
application such as sb_server.  Keeping updates in memory means that the
counts for a token can be updated multiple times with only one database
write at the end, while writing out singletons immediately keeps the
size of the change list down so that the database update doesn't take
quite so long at shutdown.

With the caching and optimization in the database engines being what it
is today, it seems that we might be better off to always write changes
to the DB immediately and dispense with the whole self.changed_words
thing altogether.  When there are multiple processes that could be using
the database at the same time, any caching (read or write) that we do
ourselves outside the database engine has the potential to generate
inconsistencies in the data anyway.

Whew, that's a much longer response than I intended.  Guess that's what
happens when things get slow before the holidays.

-- 
Kenny Pitt