[spambayes-dev] Reduced training test results
T. Alexander Popiel
popiel at wolfskeep.com
Mon Dec 29 12:51:22 EST 2003
In message: <3FEFF5F6.1090004 at hooft.net>
Rob Hooft <rob at hooft.net> writes:
>T. Alexander Popiel wrote:
>> Training on just those messages whose score isn't 0.00 or 1.00
>> (rounded) seems to be a huge win over training on everything.
>See the section "Train on Errors, Unsures, and non-obvious correct
>decisions" at http://www.entrian.com/sbwiki/TrainingIdeas
Hrm. I suppose that I ought to actually look at the wiki. ;-)
Is there any way for me to upload my plots to go along with any
discussion that I might add to the above page? I could just
reference them on my machine, but it seems better to keep the
wiki content all in one place.
>> Not so much because the accuracy is better (though accuracy
>> does seem to be improved by neglecting those messages that it's
>> already certain about), but because of a hugely reduced training
>> set (and thus database).
>Both are effects I can feel in practice!
FWIW, using this training style with my nightly retrains cut my
database size in half (from 21 meg to 10 meg). This is with a
4-month horizon, too, so the difference would likely be even
greater with a longer span.
More information about the spambayes-dev