[spambayes-dev] A spectacular false positive

Tim Peters tim.one at comcast.net
Fri Nov 14 20:02:45 EST 2003


Jeremy (Hylton) sent me some work-related email today, the output from
running a statistics-gathering program over a ZODB database.  We both
wondered why I hadn't gotten the message, and I eventually discovered that
it was actually in my Spam folder, and at "the wrong end" to boot (the view
on my Spam folder is sorted by spam score).  It had an internal ham score of
exactly 0 and an internal spam score of exactly 1.

So I trained on it as ham, and the next time he sent a similar report,
things were reversed:  the new one got ham=1 and spam=0.

So what unforgivable sin had he committed in the first email?  Heh.  It had
virtually no English text, but lots, and lots, and lots of different
integers (about 100KB worth).  There were about a half dozen strong ham
clues that it had come from him, but about 140 spam clues from the variety
of little integers, most hapaxes that had appeared in one training spam
each.

I view that mostly as a danger of mistake-based training:  as I've mentioned
before, mistake-based training tends toward being hapax-driven, and hapaxes
are brittle.  There's nothing *inherently* spammy about, say, 16384, and
because that's a power of 2 and I'm a computer geek, that *would* have
appeared in several training ham if I hadn't fallen into mistake-based
training (yes, 16384 had indeed appeared in one training spam).

So it's a cute one.  I have to note that it argues in favor of a whitelist
gimmick too -- although that wouldn't have done me any good since I never
would have anticipated that anything Jeremy sent would get scored as spam.
Even if I had anticipated it, I don't remember all the email accounts he
uses, and probably wouldn't have thought to whitelist the account he used to
send this one.

So if any spammers are reading this, here's how to get by my mistake-based
filter now:  add scads of random little integers to your spam.  If the rest
of your spam is brief enough, it will get a spam score of 0, because now my
database has even more little integer hapaxes in the *ham* direction.

amusedly y'rs  - tim




More information about the spambayes-dev mailing list