[spambayes-dev] Deprecated options

Ryan Malayter rmalayter at bai.org
Fri Aug 6 19:11:59 CEST 2004


[Tim Peters]
>In this thread, which you started <wink>, Bill said 
>"effectively 64 bits": 
>http://mail.python.org/pipermail/spambayes/2003-September/007602.html

Wow, I totally forgot about that thread. Now I realize why I was so
intruiged by this new bigrams/DB size thread ;-).

>Details are important at this level.

I agree. From the description on the CRM114 site of how hashes are
mapped to 1-MB .CSS files, I figured the effective hash length was 20
bits. I missed a step, apparently, in that CRM114 uses the address
mapping only for the starting address of a hash bucket, not as a direct
clipping of the hash value. I was just plain wrong.

>Experiments were run on that in SB before, and you can even find
>patches (probably out of date now!) in the archives that implement it.

I remember looking then, and I am still unable to find those patches (in
CVS) or the statistical results. Only anecdotal references to "hashing
performing poorly" seem to appear throghout a bunch of threads. My
google search was "CRM-114 site:python.org", there were 93 results that
I looked at, nothing pointing to the original tests of these ideas.

I guess the failure of the whole hashing issue was never really settled
in my mind, since it seems to work so well for CRM-114. But SB has been
working "good enough" for me for over a year now, so I never pursued
thigns further.

>Results were discouraging.  It did learn "faster".  After a moderate
>amount of training data, though, results were worse.  Collisions did
>hurt, and the rare bad classifications as a result of hash aliasing
>were spectacular:  incomprehensibly bad to the human eye.

Did the test just store the hash value as hex/base64/whatever in the
regular SpamBayes DB format? What hash was used? The same "fast hash"
used in CRM114?

Thanks,
	Ryan



More information about the spambayes-dev mailing list