[spambayes-dev] Deprecated options

Tim Peters tim.peters at gmail.com
Thu Aug 5 20:42:45 CEST 2004


[Ryan Malayter]
> CRM-114 uses 5-grams or even more, but ultimately uses a short hash to
> represent the n-gram strings. This (intentionally?) short hash
> (effectively 20 bits, from what I've read)

In this thread, which you started <wink>, Bill said "effectively 64 bits":

    http://mail.python.org/pipermail/spambayes/2003-September/007602.html

Details are important at this level.

> results in a lot of collisions, which keeps the classifier DB size small.
> Performance doesn't seem to suffer much at all because of these collisions.

Experiments were run on that in SB before, and you can even find
patches (probably out of date now!) in the archives that implement it.

Results were discouraging.  It did learn "faster".  After a moderate
amount of training data, though, results were worse.  Collisions did
hurt, and the rare bad classifications as a result of hash aliasing
were spectacular:  incomprehensibly bad to the human eye.  Also needs
a different database implementation to be practical (string-keyed
mappings are too wasteful when the keys come from a contiguous range
of integers, and using a Python dict to represent the mapping then is
enormously too wasteful).

> ...
> I know Bill Y. (CRM-144's creator) used to participate here, perhaps he
> could offer some ideas. To me, using SBPH to generate tokens for
> SpamBayes seems like it would be fairly straightforward. The rest of
> SpamBayes would stay mostly the same.

It's easy to experiment with, but for practical application it needs a
different database approach, to exploit the nature of the keys.


More information about the spambayes-dev mailing list