[spambayes-dev] RE: [Spambayes] How low can you go?

Tim Peters tim.one at comcast.net
Thu Dec 18 21:16:25 EST 2003


[Ryan Malayter]
>> That's not surprising at all to me. Because of the "birthday
>> paradox", ...

[Seth Goodman]
> As I understand it, the birthday paradox leads to the conclusion that
> for a 32-bit perfect hash function, after hashing around 78,000 items
> (just over 16-bits worth), you are likely to experience a _single_
> collision.  What Tim described sounded like they probably had
> multiple collisions to account for the spectacular failures they saw.
> I don't know the size of the token databases they dealt with back
> then, but I doubt a single collision in a token list of 78K items
> would affect the classifier.  Since most of the tokens are hapaxes
> anyway (perhaps 80-90% ?), it is most probable that there would be no
> visible effect.
> ...

Let me clarify this:  the experiments we ran couldn't actually use a 32-bit
hash code because they used a Python dict to simulate a giant sparse array,
and the box I was using didn't have enough RAM to deal with this load.

Instead we ran with smaller hash codes and smaller training sets, projecting
results.  The results were too discouraging for anyone here to want to
continue along that line.  It's all in the archives if you want to dig back
far enough (I don't <wink>).

With a 32-bit hash code, the expected # of collisions for a truly random
hash is close to 1, with a standard deviation also close to one, at about
92,600 items, so Seth is quite close.  With 350K items (close to the # of
tokens in the pure-unigram database I was actually using at the time), the
mean # of collisions is a bit over 14 with an sdev of about 3.8.  Those
numbers aren't scary, and Python's hash() was indeed behaving as a random
hash would have.  We were considering schemes with much higher
feature-generation rates than pure-unigram at the time, though, so all those
stats don't matter to what we were really wondering about.

BTW, discussions like this really don't belong on the spambayes list.
They're fine spambayes-dev, though, so I've set reply-to to that.  Anyone
who wants to follow that level of tech-talk should subscribe to
spambayes-dev.




More information about the Spambayes mailing list