[spambayes-dev] comment assertion error? revisit DBDictClassifier assumptions?

Skip Montanaro skip at pobox.com
Tue Dec 23 15:26:06 EST 2003


    Tim> On Tue, 23 Dec 2003 09:43:49 -0600, Skip Montanaro <skip at pobox.com> wrote:
    >> Perhaps we should reexamine the caching in DBDictClassifier.

    Tim> I have no idea where that comment came from... 

That much I can tell you.  Mark wrote the comment on May 30th.  Here's the
checkin comment:

    2 changes to the way the DB classifier manages words:

    * As per Tim P's mail, keep a list of "changed words" with a flag
    indicating "change" or "delete".  This prevents the database save
    from updating every single word ever loaded by the db.

    * From Sean, a change that prevents caching of hapaxes.  Such words are
    saved directly to the DB.  This reduces the memory footprint significantly
    (as these words are not kept in memory) and helps save times.

    This change makes "incremental" saving of the database happen in a
    reasonable time, and doesn't degrade after a complete retrain etc.

    I'm off for a weekend holiday - someone can just back this out if I
    screwed it up <wink>

Perhaps Mark can elaborate when he returns from holiday.

If we are going to cache lookups in the file-based classifiers, I'd prefer
to restructure things so we can reuse behavior defined in
classifier.Classifier wherever possible.  That means that self.wordinfo
should refer to the real file storage, not a cache.  _wordinfoget() and
friends can then rely on the versions in classifier.Classifier and fron that
functionality with caches or other apply other annotations.  This all breaks
down when you consider the SQL-based classifiers, but they've only ever been
experimental (I think - is anyone using them on a regular basis?), so I
think it's okay for the maintenance burden to be higher for them.

Skip



More information about the spambayes-dev mailing list