[Spambayes] Outlook plugin - training

Tim Peters tim.one@comcast.net
Mon Nov 11 01:17:46 2002


[Tim]
>> ... my primary interest here is to see how bad things can get if
>> a user takes mistake-based training to an extreme.  Despite that
>> it's heavily hapax-driven, it appears to do very well when judged by
>> error rate.

[Rob Hooft]
> Hm. There are so little fp/fn's relative to unsures (at least after 30
> messages initial training), that it wouldn't matter much (I think).

As I tried to explain later, the psychological impact of the Unsures isn't
attractive, though -- they remain bizarre to human eyes.  When I got up
today, I got 6 new Unsure spam:  human growth hormone, gay porn, life
insurance, mortgage rates, a msg that made no sense (empty except for a
Yahoo auto-generated sig), and Genuine Leather Jackets.  It's not picking up
on general "this is advertising" clues, or even on general "this is gay
porn" clues.  Indeed, "XXX" is still a hapax!  This particular HGH spam will
never get through again, because training it found 80(!) hapaxes unique to
it.  It's not going to do much to stop other HGH spam, though -- this one
was especially chatty, and added words like 'forget', 'hair', 'lose', 'lost'
and 'anywhere' to the collection of (what are now, after training on it)
spam hapaxes -- just as previous HGH spam trained on didn't stop this one.
To my eyes, I had already told it about HGH spam, and I'm irked that it
showed me another one.  Ditto gay porn, ditto life insurance, etc.


[on database growth as a function of # of msgs]
> Hm, it is more like a sqrt after more messages. See attached image which
> has a sqrt X axis. The fit fits the data even at the lowest end.

Cool!  That was a dramatic graph indeed.  Soon there will be no mysteries
remaining <wink>.




More information about the Spambayes mailing list