[Spambayes] Outlook plugin - training
Rob Hooft
rob@hooft.net
Sun Nov 10 12:28:44 2002
This is a multi-part message in MIME format.
---------------------- multipart/mixed attachment
Tim Peters wrote:
> [Rob Hooft]
>
>>I just added a testdriver to CVS that simulates your behaviour as I
>>understand it: It will train on the first 30 messages,
>
>
> I trained on 1 of each at the start. If I were to do it over, I'd start
> with an empty database <wink>.
This is easy enough to change, but I left it at 30 for now.
> Since I'm doing this real-time on my live email, I've been training "on the
> worst" (farthest away from correct) msg that arrives in a batch, then
> rescoring all the ones that arrived in the batch, then training the worst
> remaining, ... until all new ham is below ham_cutoff and all new spam above
> spam_cutoff. I don't know that it matters, just being clear(er). As things
> turned out, this worst-at-a-time training never managed to push one of the
> remaining mistakes/unsures into the correct category, *except* for cases
> where I got more than one copy of a spam from different accounts at the same
> time. Then it always pushed the copies into scoring near 1.0, since the
> hapaxes in the training copy are abundant.
But I'm doing exactly the same, except that my batch size is always 1 ;-)
>>It may not even be very realistic to training on fp's, as I think in my
>>private E-mail I won't even check the spam folder very thoroughly at all.
> But I will (and do), and my primary interest here is to see how bad things
> can get if a user takes mistake-based training to an extreme. Despite that
> it's heavily hapax-driven, it appears to do very well when judged by error
> rate.
Hm. There are so little fp/fn's relative to unsures (at least after 30
messages initial training), that it wouldn't matter much (I think).
>> * The database growth doesn't decay with time after a while;
>> it can be described as:
>> nwords = 9200 + 1.6 * nmessages
>> or alternatively:
>> nwords = 5700 + 40 * ntrained
>> ..as can be seen in the attached png's
>
>
> I expect that's mostly because there are still (relatively) few total msgs
> trained on.
Hm, it is more like a sqrt after more messages. See attached image which
has a sqrt X axis. The fit fits the data even at the lowest end.
Regards,
Rob
--
Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/
---------------------- multipart/mixed attachment
A non-text attachment was scrubbed...
Name: words3.png
Type: image/png
Size: 13675 bytes
Desc: not available
Url : http://mail.python.org/pipermail/spambayes/attachments/20021110/b5905d0f/words3-0001.png
---------------------- multipart/mixed attachment--
More information about the Spambayes
mailing list