[spambayes-dev] comment assertion error? revisit DBDictClassifier
assumptions?
Skip Montanaro
skip at pobox.com
Tue Dec 23 10:43:49 EST 2003
The comment for DBDictClassifier._wordinfoset says:
# "Singleton" words (i.e. words that only have a single instance)
# take up more than 1/2 of the database, but are rarely used
# so we don't put them into the wordinfo cache, but write them
# directly to the database
# If the word occurs again, then it will be brought back in and
# never be a singleton again.
# This seems to reduce the memory footprint of the DBDictClassifier by
# as much as 60%!!! This also has the effect of reducing the time it
# takes to store the database
With the recent testing of bigrams the clause "but are rarely used" would
seem to be at least partially false. I'm not too concerned about memory
footprint of the classifier, since I have lots of memory and use
sb_filter.py, not one of the long-running servers or plugins. I also wonder
about the contention that it reduces the database store time. It's probably
true that the time spent at shutdown is shorter, but that time has been
amortized over the entire runtime of the program.
Perhaps we should reexamine the caching in DBDictClassifier. I would like
it to be able to inherit a bit more functionality from its base class. If
the assumptions it makes aren't entirely accurate, much of the extra work
maintaining caches might be avoided.
Skip
More information about the spambayes-dev
mailing list