[spambayes-dev] A spectacular false positive

Tue Nov 18 10:53:21 EST 2003

[Toby Dickenson]
> I occasionally see the inverse problem. I train on every email I
> receive, including many hams containing lots of numbers like Jeremy
> sent you. Occasionally I get a spam where 2 or 3 numbers (in a price
> list, usually) are enough to classify it as ham.

If you train on everything, and you get substantially more ham than spam,
then your training data is unbalanced in a way that would (I think) push in
that direction.

> Would you have been as suprised by the same result if Jeremy had sent
> you a long list of effectively random words?

Yes, I'd expect that to tend toward unsure, given the way I've trained.

I tried generating a random email like so:

>>> f = file('/updates/word.lst')
>>> d = dict.fromkeys(f)
>>> len(d)
173528
>>> import random
>>> for w in random.sample(d, 300):
...    print w,

and then pasting the result into an email.  word.lst is just a list of
English words, one per line.

That wasn't particularly revealing:  it scored as a low Unsure (22), but
very few of the words had ever been trained on, so were simply ignored (for
example, I had never trained on burkites, zemstvo, or morphallaxes before).
The few words that remained were solidly hammy (compiler, initial) or
solidly spammy (male, sexy), about the same number of each.  What pushed it
toward the ham side of unsure were the half-dozen header clues claiming that
the message was sent from me, and to me using my real name.

I tried again, boosting the # of random words to 3000, to try to stumble
into more I'd actually trained on.  As expected, that pushed it more toward
exactly Unsure:

Combined Score: 47% (0.465326)
Internal ham score (*H*): 0.571772
Internal spam score (*S*): 0.502424

Little integers are different for me, because while they show up in tons of
geek ham, I've trained on very little of that because that kind of stuff
rarely scores above 1, and almost never scores above my ham cutoff of 20.
So mistake-based training almost never trains on geek ham anymore.  My
non-geek friends don't write much about integers <wink>.