[spambayes-dev] A spectacular false positive
Toby Dickenson
tdickenson at devmail.geminidataloggers.co.uk
Mon Nov 17 04:11:36 EST 2003
On Saturday 15 November 2003 01:02, Tim Peters wrote:
> It had
> virtually no English text, but lots, and lots, and lots of different
> integers (about 100KB worth). There were about a half dozen strong ham
> clues that it had come from him, but about 140 spam clues from the variety
> of little integers, most hapaxes that had appeared in one training spam
> each.
>
> I view that mostly as a danger of mistake-based training: as I've
> mentioned before, mistake-based training tends toward being hapax-driven,
> and hapaxes are brittle. There's nothing *inherently* spammy about, say,
> 16384, and because that's a power of 2 and I'm a computer geek, that
> *would* have appeared in several training ham if I hadn't fallen into
> mistake-based training (yes, 16384 had indeed appeared in one training
> spam).
I occasionally see the inverse problem. I train on every email I receive,
including many hams containing lots of numbers like Jeremy sent you.
Occasionally I get a spam where 2 or 3 numbers (in a price list, usually) are
enough to classify it as ham.
Would you have been as suprised by the same result if Jeremy had sent you a
long list of effectively random words?
--
Toby Dickenson
More information about the spambayes-dev
mailing list