[spambayes-dev] A spectacular false positive

Tim Peters tim.one at comcast.net
Sat Nov 15 17:02:58 EST 2003


[Richie Hindle]
>>> Perhaps it's argument for not classifying using hapaxes?  Wait for
>>> any given clue to appear in more than one message before it becomes
>>> valid for classification.  Has anyone tried this?  (And not just for
>>> SpamBayes - Bill?)

[Rob Hooft]
>> Í hävè nöt tríéd ìt, büt Î äm qûìtë sürè ít wöûld pérfòrm wòrsë!

[Richie]
> 8-)
>
> I'm sure it would perform worse in the short term, but as the size of
> the training set increased, I think the performance would pretty much
> catch up while the chance of false positives would remain
> significantly smaller. (I speak with the conviction of someone with
> no evidence and negligible mathematical ability...)

Graham's original scheme ignored tokens that hadn't appeared at least 5
times in training data.  Some of the very earliest experiments played with
that, moving the cutoff both higher and lower.  The evidence was very clear
(not like the noise-level results most recent experiments have shown -- this
was "0 lost 1 tied 9 won" territory) that a cutoff of 0 worked best.

Part of the "reason" is surely that *every* token first *enters* the
database as a hapax.  When new kinds of fuzzy ham and spam appear, one
example often introduces enough hapaxes so that the next instance of the
same kind of thing is nailed to the correct category just from scoring the
hapaxes in it.

I noticed this dramatically during the last major round of worm spew, where
I was getting about 1,000 worm-related turds each day.  Like Skip suggested
recently, I only trained on one at a time, and then rescored the morning's
unsures.  Training on 6 total examples turned out to be enough that I never
had to train on another -- and "that worked" almost purely by capturing
different hapaxes unique to about 6 different variations of the worm spew I
was getting.

So hapaxes are (I believe) really the heart of what lets lazy, minimal
mistake-based training work as well as it does.  It will always be brittle,
though.

A scheme I would like to try can't be tried easily anymore because we
removed some of the info it needs from our database:  ignore hapaxes that
haven't been *used* in scoring over the last (say) week.  Spam especially
seems to come in spurts, where I might get 100 copies in a few days of a
spam containing "16384".  That hapax is very valuable in nailing minor
variations of that spam until that spam campaign ends; but after that point,
I probably never use it to score a spam again, yet it stays in the database
forever.  If it stays there long enough, Jeremy is eventually going to use
it too <wink>.

Especially since more & more of us are inclining toward using tiny databases
(compared to what we used to do), making space for a "last used" timestamp
may not be nearly as scary as it used to be.




More information about the spambayes-dev mailing list