[Spambayes] Outlook plugin - training

Fri Nov 8 00:06:27 2002

In message:  <BIEJKCLHCIOIHAGOKOLHMEDFDOAA.tim@zope.com>
             "Tim Peters" <tim@zope.com> writes:
>[Anthony Baxter]
>> Note that "random sample" is not as trivial as all that, either - if
>> you have a very high ham:spam ratio in your training DB, your accuracy
>> will suffer (see the tests from Alex, myself and others).
>
>I still need to try to make sense of those tests.  A real complication is
>that more than one thing changes when trying to test ratios:  it's not just
>the ratio that changes, it's the absolute number of each trained on too.

True.

>For example, (a) train on 5000 ham and 1000 spam; or, (b) train on 50000 ham
>and 10000 spam.  The ratios are identical.  Do we expect the error rates to
>be identical too?  I don't, but haven't tried it.

I have tried this, and the effects of ratio were diminished
as the training set size increased.  For details, see
http://www.wolfskeep.com/~popiel/spambayes/ratio2 .  The
tests were done with gary-combining, not chi-square, so I
really ought to rerun them.

>I expect the latter would do better than the former, despite the identical
>ratios, simply because more msgs allow better spamprob estimates.

It depended on what the ratio in question was... for 1:4
ham:spam, increased training set size hurt instead of helped,
in the ranges that I was able to test.  For 1:1, increased
training helped instead of hurt.

>Something missing in "the ratio tests" is a rationale (even an
>after-the-fact one) for believing there's some aspect of the system that's
>sensitive to the ratio.  The combining method certainly is not, and the
>spamprob estimation (update_probabilities()) deliberately works with
>percentages instead of raw counts so that the ham::spam training ratio
>has no direct effect on the spamprobs calculated.

Eh, I have a perfectly good rationale for believing that
something is sensitive the the ratio: the tests I've run
show such a sensitivity.  What's missing is a theory on
_why_ there's a sensitivity. ;-)

I don't think the following theory is perfectly phrased, but
it seems plausible to me:

Perhaps the number of topics discussed in ham is greater
than that in spam.  Thus, the average percentage of ham
messages containing a particular significant ham word is
systematically lower than the average probability of a
particular significant spam word appearing in spam messages.
As the training set size increases, the percentage difference
becomes more consistent and pronounced.  Since we're then
combining the percentages, we systematically skew slightly
due to the differing averages.

Changing the ratio of ham to spam has the effect of changing
the number of topics discussed, particularly when the training
set size is small and random chance can exclude all instances
of a given topic.  Balancing the number of topics removes the
skew in the probabilities.  As training set size increases,
adjusting the ratio has less effect, because it has less
likelyhood of eliminating topics of discussion.

I think that would account for my data.

>The total # of spam training msgs does limit how high a spamprob can get,
>and the total # of ham training msgs limits how low.  The *suspicion* I had
>running my large c.l.py test is that it wasn't the ratio that mattered so
>much as the absolute number, and that the error rates didn't "settle down"
>to the 4th digit until I got near 10,000 spam total.

I suspect that by the time the corpora got that large, adjusting
the training ratio wouldn't make a lick of difference if the
corpora were sampled randomly to achieve the given ratio.  There
would just be too little chance of excluding a topic from the
samples.  Systematically excluding a topic might produce equivalent
results to my ratio tests.

- Alex