[Spambayes] Outlook plugin - training

Sun Nov 10 12:28:44 2002

This is a multi-part message in MIME format.
---------------------- multipart/mixed attachment
Tim Peters wrote:
> [Rob Hooft]
> 
>>I just added a testdriver to CVS that simulates your behaviour as I
>>understand it: It will train on the first 30 messages,
> 
> 
> I trained on 1 of each at the start.  If I were to do it over, I'd start
> with an empty database <wink>.

This is easy enough to change, but I left it at 30 for now.

> Since I'm doing this real-time on my live email, I've been training "on the
> worst" (farthest away from correct) msg that arrives in a batch, then
> rescoring all the ones that arrived in the batch, then training the worst
> remaining, ... until all new ham is below ham_cutoff and all new spam above
> spam_cutoff.  I don't know that it matters, just being clear(er).  As things
> turned out, this worst-at-a-time training never managed to push one of the
> remaining mistakes/unsures into the correct category, *except* for cases
> where I got more than one copy of a spam from different accounts at the same
> time.  Then it always pushed the copies into scoring near 1.0, since the
> hapaxes in the training copy are abundant.

But I'm doing exactly the same, except that my batch size is always 1 ;-)

>>It may not even be very realistic to training on fp's, as I think in my
>>private E-mail I won't even check the spam folder very thoroughly at all.

> But I will (and do), and my primary interest here is to see how bad things
> can get if a user takes mistake-based training to an extreme.  Despite that
> it's heavily hapax-driven, it appears to do very well when judged by error
> rate.

Hm. There are so little fp/fn's relative to unsures (at least after 30 
messages initial training), that it wouldn't matter much (I think).

>>  * The database growth doesn't decay with time after a while;
>>    it can be described as:
>>       nwords = 9200 + 1.6 * nmessages
>>    or alternatively:
>>       nwords = 5700 + 40 * ntrained
>>    ..as can be seen in the attached png's
> 
> 
> I expect that's mostly because there are still (relatively) few total msgs
> trained on.

Hm, it is more like a sqrt after more messages. See attached image which 
has a sqrt X axis. The fit fits the data even at the lowest end.

Regards,

Rob

-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/

---------------------- multipart/mixed attachment
A non-text attachment was scrubbed...
Name: words3.png
Type: image/png
Size: 13675 bytes
Desc: not available
Url : http://mail.python.org/pipermail/spambayes/attachments/20021110/b5905d0f/words3-0001.png

---------------------- multipart/mixed attachment--