[spambayes-dev] comment assertion error? revisit DBDictClassifier assumptions?

Skip Montanaro skip at pobox.com
Tue Dec 23 10:43:49 EST 2003


The comment for DBDictClassifier._wordinfoset says:

    # "Singleton" words (i.e. words that only have a single instance)
    # take up more than 1/2 of the database, but are rarely used
    # so we don't put them into the wordinfo cache, but write them
    # directly to the database
    # If the word occurs again, then it will be brought back in and
    # never be a singleton again.
    # This seems to reduce the memory footprint of the DBDictClassifier by
    # as much as 60%!!!  This also has the effect of reducing the time it
    # takes to store the database

With the recent testing of bigrams the clause "but are rarely used" would
seem to be at least partially false.  I'm not too concerned about memory
footprint of the classifier, since I have lots of memory and use
sb_filter.py, not one of the long-running servers or plugins.  I also wonder
about the contention that it reduces the database store time.  It's probably
true that the time spent at shutdown is shorter, but that time has been
amortized over the entire runtime of the program.

Perhaps we should reexamine the caching in DBDictClassifier.  I would like
it to be able to inherit a bit more functionality from its base class.  If
the assumptions it makes aren't entirely accurate, much of the extra work
maintaining caches might be avoided.

Skip



More information about the spambayes-dev mailing list