[spambayes-dev] A spectacular false positive

Toby Dickenson tdickenson at devmail.geminidataloggers.co.uk
Mon Nov 17 04:11:36 EST 2003


On Saturday 15 November 2003 01:02, Tim Peters wrote:

> It had
> virtually no English text, but lots, and lots, and lots of different
> integers (about 100KB worth).  There were about a half dozen strong ham
> clues that it had come from him, but about 140 spam clues from the variety
> of little integers, most hapaxes that had appeared in one training spam
> each.
>
> I view that mostly as a danger of mistake-based training:  as I've
> mentioned before, mistake-based training tends toward being hapax-driven,
> and hapaxes are brittle.  There's nothing *inherently* spammy about, say,
> 16384, and because that's a power of 2 and I'm a computer geek, that
> *would* have appeared in several training ham if I hadn't fallen into
> mistake-based training (yes, 16384 had indeed appeared in one training
> spam).

I occasionally see the inverse problem. I train on every email I receive, 
including many hams containing lots of numbers like Jeremy sent you. 
Occasionally I get a spam where 2 or 3 numbers (in a price list, usually) are 
enough to classify it as ham.

Would you have been as suprised by the same result if Jeremy had sent you a 
long list of effectively random words? 

-- 
Toby Dickenson




More information about the spambayes-dev mailing list