[Spambayes] mining dates?

Tim Peters tim.one@comcast.net
Tue, 01 Oct 2002 00:03:17 -0400


[Tim]
>> 2. Greg Ward suggested two Date things SpamAssassin looks for:
>>
>> SPAM: *  1.6 -- Invalid Date: header (not RFC 2822)

[Neil Schemenauer]
> Tried that.  It didn't help my error rate so I mercilessly killed it.

Hmm.  You generally chop off the lines revealing how large a test you're
running, but from your total error rates in the last report:

    total unique fp went from 5 to 5 tied
    mean fp % went from 0.25 to 0.25 tied

    total unique fn went from 15 to 15 tied
    mean fn % went from 0.75 to 0.75 tied

it seems a safe bet that you're predicting against 200 messages per run.  In
that case, the smallest non-zero *change* in a one-run error rate you could
possibly see is 0.5% (1 of 200 msgs), which essentially *is* your overall
error rate.  In other words, like me, you've reached the point where your
corpus can no longer support measuring improvements reliably --  even if a
solid but modest improvement were to be made, it's quite likely you couldn't
measure it.

That leaves us staring at ham & spam means & sdevs, which are still good
indicators of whether a change moves "in a good direction", but isn't as
exciting as watching error rates plummet.  Moving to a larger corpus would
help make your life more interesting again:  sign up for more mailing lists
<wink>.