[Spambayes] Outlook plugin - training
Tim Peters
tim.one@comcast.net
Mon Nov 11 01:17:46 2002
[Tim]
>> ... my primary interest here is to see how bad things can get if
>> a user takes mistake-based training to an extreme. Despite that
>> it's heavily hapax-driven, it appears to do very well when judged by
>> error rate.
[Rob Hooft]
> Hm. There are so little fp/fn's relative to unsures (at least after 30
> messages initial training), that it wouldn't matter much (I think).
As I tried to explain later, the psychological impact of the Unsures isn't
attractive, though -- they remain bizarre to human eyes. When I got up
today, I got 6 new Unsure spam: human growth hormone, gay porn, life
insurance, mortgage rates, a msg that made no sense (empty except for a
Yahoo auto-generated sig), and Genuine Leather Jackets. It's not picking up
on general "this is advertising" clues, or even on general "this is gay
porn" clues. Indeed, "XXX" is still a hapax! This particular HGH spam will
never get through again, because training it found 80(!) hapaxes unique to
it. It's not going to do much to stop other HGH spam, though -- this one
was especially chatty, and added words like 'forget', 'hair', 'lose', 'lost'
and 'anywhere' to the collection of (what are now, after training on it)
spam hapaxes -- just as previous HGH spam trained on didn't stop this one.
To my eyes, I had already told it about HGH spam, and I'm irked that it
showed me another one. Ditto gay porn, ditto life insurance, etc.
[on database growth as a function of # of msgs]
> Hm, it is more like a sqrt after more messages. See attached image which
> has a sqrt X axis. The fit fits the data even at the lowest end.
Cool! That was a dramatic graph indeed. Soon there will be no mysteries
remaining <wink>.
More information about the Spambayes
mailing list