[Spambayes] Outlook plugin - training

Thu Nov 7 22:11:22 2002

[Anthony Baxter]
> Note that "random sample" is not as trivial as all that, either - if
> you have a very high ham:spam ratio in your training DB, your accuracy
> will suffer (see the tests from Alex, myself and others).

I still need to try to make sense of those tests.  A real complication is
that more than one thing changes when trying to test ratios:  it's not just
the ratio that changes, it's the absolute number of each trained on too.
For example, (a) train on 5000 ham and 1000 spam; or, (b) train on 50000 ham
and 10000 spam.  The ratios are identical.  Do we expect the error rates to
be identical too?  I don't, but haven't tried it.  I expect the latter would
do better than the former, despite the identical ratios, simply because more
msgs allow better spamprob estimates.

Something missing in "the ratio tests" is a rationale (even an
after-the-fact one) for believing there's some aspect of the system that's
sensitive to the ratio.  The combining method certainly is not, and the
spamprob estimation (update_probabilities()) deliberately works with
percentages instead of raw counts so that the ham::spam training ratio has
no direct effect on the spamprobs calculated.

> An easy example of this is those of us who are on a bunch of higher
> volume python.org lists - Greg's sterling work there means that very
> little spam gets through there.

The total # of spam training msgs does limit how high a spamprob can get,
and the total # of ham training msgs limits how low.  The *suspicion* I had
running my large c.l.py test is that it wasn't the ratio that mattered so
much as the absolute number, and that the error rates didn't "settle down"
to the 4th digit until I got near 10,000 spam total.

> As spambayes takes over the world, this could be a larger problem.

Despite all the above <wink>, when faking "random sample" by hand in my
personal classifiers, I see I've *ended up* aiming for about an equal number
of each in my training data.  That works well too (for me, and
anecdotally -- these aren't controlled experiments).