[spambayes-dev] RE: [Spambayes] How low can you go?

Fri Dec 26 21:13:06 EST 2003

[Seth Goodman]
>> This gives me a roughly 1:5 ham/spam corpus, instead of roughly
>> even, but that's the mail stream that SpamBayes sees.

[T. Alexander Popiel]
> This is the stuff I'd tend to use for the testing, as opposed to your
> equal-sized training sets.

If Seth is going to test incremental training regimes, then yes, his entire
email stream (well, the parts of it scored by spambayes -- he said he uses
Outlook rules to exempt a large part of it from getting scored at all)
should be included.

If he wants to do cross-validation testing, he should still bust it all up
into the same number of sets.  timcv's "ham-keep" and "spam-keep" options
can be used then to select random equal-sized (or non-equal-sized) subsets
dynamically.

In your (Alex's) recent "nonedge" incremental training experiment, it looks
like your training data grew to about a 5.5::1 spam::ham ratio after 400
days.  I know my personal classifiers start acting flaky whenever I've let
them get imbalanced by more than 2::1 in either direction.  So if I had your
data, I'd be curious to try variations that force better balance.  I have my
data, but it's less than a week old <wink>.  You have enough data that it
may well be more interesting to you to try variations including expiration
(the second derivative of your "Cumulative Trained Counts" ham training
curve appears slightly negative, but your spam training curve appears mostly
straight except for two points where it clearly gets steeper -- a hypothesis
is that your ham isn't changing much over time, but that your spam is, the
weight of the old spam training data is making it harder to adjust to the
spam changes, and that this gets worse over time; OTOH, with the spam::ham
training imbalance getting worse over time too, it may just be that the
classifier is getting flakier over time too for that reason alone).