[spambayes-dev] Reduced training test results

Sat Dec 27 00:52:37 EST 2003

[T. Alexander Popiel]
> Training on just those messages whose score isn't 0.00 or 1.00
> (rounded) seems to be a huge win over training on everything.
> Not so much because the accuracy is better (though accuracy
> does seem to be improved by neglecting those messages that it's
> already certain about),

I'm afraid TOE gives too much weight to systematically correlated tokens.
My experience with python.org mailing lists has pointed in that direction
since the start, but it's probably more general than that.  In a recap
nutshell, every piece of email coming from python.org has (with
mine_received_headers enabled) about a dozen tokens effectively saying "I
came from python.org".  I get several hundred ham like that every day, but
also a few spam per week.  Under TOE, the "python.org clues" get spamprobs
approaching 0, and a dozen very strong ham tokens is hard to overcome.  As a
result, it's *hard* for a spam leaking thru python.org to score as spam on
my end -- even under mistake-based training, where the spamprobs on
python.org-tokens are much higher than they'd be under TOE.

I expect most (maybe all) of the developers here have similar long-term
sources of ham, feeding you daily with correlated tokens effectively
identifying the source.

An irony is that I don't need those python.org tokens:  the *content* of
those msgs is solidly hammy even without them.  Maybe we should ignore our
strongest clues <0.5 wink>.

> but because of a hugely reduced training set (and thus database).
> Specifically, training on everything yielded a database with 70,000
> messages, while training only on the non-extreme put only about
> 3,500 messages into the database.  Unfortunately, I don't have firm
> numbers on token counts.

That's OK.  It was rigorously established before that the # of tokens either
does or doesn't go up with the square root, or some other function, of the
message count <wink>.

> Also of significant interest is that the classifier doesn't seem
> to decay as badly over time.  With training on everything, the
> unsure rate in particular (and fn to a much lesser extent) goes
> up significantly after about 200 days worth of traffic,

That's peculiar.  Did you try this with different starting dates, and find
that "about 200 days" was invariant across starting dates -- or did you try
a single starting date, and note that something funny happened about 200
days after that single starting date.  I think the latter, in which case
it's natural to speculate that something significant changed around then in
your ham and/or spam mix.

Thanks for the report, Alex!  Good work.