[Spambayes] Effects of ham to spam ratio
T. Alexander Popiel
popiel@wolfskeep.com
Tue, 08 Oct 2002 15:58:37 -0700
In message: <BIEJKCLHCIOIHAGOKOLHGEJMDLAA.tim.one@comcast.net>
Tim Peters <tim.one@comcast.net> writes:
>
>the only semi-automated way to extract the 4 error rates (fp/fn when
>certain/uncertain) is to set nbuckets to 4 and stare at the little
>histograms.
I'll see if I can get something to read those histograms for me,
when I start doing the central limit testing. ;-)
>> I again used timcv.py as my test driver, this time with 200
>> messages in each ham/spam set.
>
>How many sets (-n10, -n5, ...?). Looks like 5.
Yeah, I was only using 5 sets, even though I have 10 available. Doh!
>> There are several interesting things here:
>>
>> 1. The false positive rate remains insignificant throughout.
>> 2. The false negative rate drops significantly as the ham:spam
>> ratio goes down. The more spam you have in your mailfeed,
>> the better this whole thing works.
>
>The reason isn't clear, though: it may well have less to do with the ratio
>than with the absolute quantity of spam trained on. If there's sufficient
>variety in your spam, it could simply be that 200 is way too few to get a
>representative sampling of the diversity your spam, umm, enjoys <wink..
Well, given my prior experiment last Friday(?) on training set size,
which showed virtually no improvement in spam recognition as the training
set grew across the range I'm dealing with here, I don't think it's just
quantity that's the cause. I probably should have mentioned those results
again. They're still available at:
http://www.wolfskeep.com/~popiel/spambayes/trainsize
I've also put an index of my experiments at:
http://www.wolfskeep.com/~popiel/spambayes
>> 3. The ham:spam ratio affects the spam sdev much more than the
>> ham sdev.
>
>Which is more reason to be suspicious: sdev is a measure of how wild the
>data is. If the sdev gets steady as the absolute count increases, it means
>the data is "settling down". Your spam sdev goes up by about 0.50 in each
>column, with no sign of settling down "to the left", which suggests that
>even at the 50-200 extreme it's *still* finding plenty of new stuff in the
>spam.
True. Hrm.
>Do you have a lot of Asian spam? The gimmicks we've got for that ("skip"
>and "8bit%" meta-tokens) learn slowly, and that "skip" learns at all here is
>just a lucky accident.
Nope. No Asian spam at all. My spam is mostly in English, with a
fair amount of German porn spam (I have _no_ idea how I got onto
that list) and one or two spams in Spanish or Italian (I'm not sure
which).
>> 4. Tim's k value (mean separation divided by sum of standard
>> deviations) is best with slightly less ham than spam (at 2:3),
>> which happens to be about the same ratio as in my real mailfeed.
>>
>> It would be very interesting to find out if the best ham:spam
>> ratio for k (#4 above) is constant, or if it's actually tied to
>> the ratio in the real mail feed from which the training data is
>> taken. This may be hard to measure for people who are using
>> corpora augmented from several sources.
>
>It would be better <wink> to get independent results from the same kind of
>test but run with more data. I know that, for example, in my data, I have
>to train on several thousand spam before the improvement in spam
>identification slows to a crawl.
I'll rerun using all 10 sets instead of just 5. *blush*
- Alex