[Spambayes] mining dates?
Tim Peters
tim.one@comcast.net
Tue, 01 Oct 2002 00:03:17 -0400
[Tim]
>> 2. Greg Ward suggested two Date things SpamAssassin looks for:
>>
>> SPAM: * 1.6 -- Invalid Date: header (not RFC 2822)
[Neil Schemenauer]
> Tried that. It didn't help my error rate so I mercilessly killed it.
Hmm. You generally chop off the lines revealing how large a test you're
running, but from your total error rates in the last report:
total unique fp went from 5 to 5 tied
mean fp % went from 0.25 to 0.25 tied
total unique fn went from 15 to 15 tied
mean fn % went from 0.75 to 0.75 tied
it seems a safe bet that you're predicting against 200 messages per run. In
that case, the smallest non-zero *change* in a one-run error rate you could
possibly see is 0.5% (1 of 200 msgs), which essentially *is* your overall
error rate. In other words, like me, you've reached the point where your
corpus can no longer support measuring improvements reliably -- even if a
solid but modest improvement were to be made, it's quite likely you couldn't
measure it.
That leaves us staring at ham & spam means & sdevs, which are still good
indicators of whether a change moves "in a good direction", but isn't as
exciting as watching error rates plummet. Moving to a larger corpus would
help make your life more interesting again: sign up for more mailing lists
<wink>.