[Spambayes] Effects of ham to spam ratio

Tue, 08 Oct 2002 15:58:37 -0700

In message:  <BIEJKCLHCIOIHAGOKOLHGEJMDLAA.tim.one@comcast.net>
             Tim Peters <tim.one@comcast.net> writes:
>
>the only semi-automated way to extract the 4 error rates (fp/fn when
>certain/uncertain) is to set nbuckets to 4 and stare at the little
>histograms.

I'll see if I can get something to read those histograms for me,
when I start doing the central limit testing. ;-)

>> I again used timcv.py as my test driver, this time with 200
>> messages in each ham/spam set.
>
>How many sets (-n10, -n5, ...?).  Looks like 5.

Yeah, I was only using 5 sets, even though I have 10 available.  Doh!

>> There are several interesting things here:
>>
>> 1. The false positive rate remains insignificant throughout.
>> 2. The false negative rate drops significantly as the ham:spam
>>    ratio goes down.  The more spam you have in your mailfeed,
>>    the better this whole thing works.
>
>The reason isn't clear, though:  it may well have less to do with the ratio
>than with the absolute quantity of spam trained on.  If there's sufficient
>variety in your spam, it could simply be that 200 is way too few to get a
>representative sampling of the diversity your spam, umm, enjoys <wink..

Well, given my prior experiment last Friday(?) on training set size,
which showed virtually no improvement in spam recognition as the training
set grew across the range I'm dealing with here, I don't think it's just
quantity that's the cause.  I probably should have mentioned those results
again.  They're still available at:

  http://www.wolfskeep.com/~popiel/spambayes/trainsize

I've also put an index of my experiments at:

  http://www.wolfskeep.com/~popiel/spambayes

>> 3. The ham:spam ratio affects the spam sdev much more than the
>>    ham sdev.
>
>Which is more reason to be suspicious:  sdev is a measure of how wild the
>data is.  If the sdev gets steady as the absolute count increases, it means
>the data is "settling down".  Your spam sdev goes up by about 0.50 in each
>column, with no sign of settling down "to the left", which suggests that
>even at the 50-200 extreme it's *still* finding plenty of new stuff in the
>spam.

True.  Hrm.

>Do you have a lot of Asian spam?  The gimmicks we've got for that ("skip"
>and "8bit%" meta-tokens) learn slowly, and that "skip" learns at all here is
>just a lucky accident.

Nope.  No Asian spam at all.  My spam is mostly in English, with a
fair amount of German porn spam (I have _no_ idea how I got onto
that list) and one or two spams in Spanish or Italian (I'm not sure
which).

>> 4. Tim's k value (mean separation divided by sum of standard
>>    deviations) is best with slightly less ham than spam (at 2:3),
>>    which happens to be about the same ratio as in my real mailfeed.
>>
>> It would be very interesting to find out if the best ham:spam
>> ratio for k (#4 above) is constant, or if it's actually tied to
>> the ratio in the real mail feed from which the training data is
>> taken.  This may be hard to measure for people who are using
>> corpora augmented from several sources.
>
>It would be better <wink> to get independent results from the same kind of
>test but run with more data.  I know that, for example, in my data, I have
>to train on several thousand spam before the improvement in spam
>identification slows to a crawl.

I'll rerun using all 10 sets instead of just 5. *blush*

- Alex