[spambayes-dev] Deprecated options
tim.peters at gmail.com
Mon Aug 9 03:29:31 CEST 2004
> I remember looking then, and I am still unable to find those patches (in
> CVS) or the statistical results. Only anecdotal references to "hashing
> performing poorly" seem to appear throghout a bunch of threads. My
> google search was "CRM-114 site:python.org", there were 93 results that
> I looked at, nothing pointing to the original tests of these ideas.
There were many threads that tried hashing for one reason or another.
Sorry, I can't make time to search for them. One experiment clearly
related to CRM-114, with patch, is here:
For whatever reason, pipermail gave the attachment an .exe extension.
Rename it to .txt (or whatever works for you for a patch file).
> I guess the failure of the whole hashing issue was never really settled
> in my mind, since it seems to work so well for CRM-114. But SB has been
> working "good enough" for me for over a year now, so I never pursued
> thigns further.
The CRM experiment had much more to do with generating huge piles of
highly correlated features than with hashing. CRM-114 does everything
differently, from tokenization through combining rule. The experiment
only changed one thing in SB, and that experiment was such a disaster
there was no incentive to try to figure out if changing N other things
too may have helped.
> Did the test just store the hash value as hex/base64/whatever in the
> regular SpamBayes DB format?
> What hash was used? The same "fast hash" used in CRM114?
Answered in the msg linked to above.
More information about the spambayes-dev