[Spambayes] Testers needed with unbalanced spam::ham training data

Tim Peters tim.one@comcast.net
Sun Nov 17 19:38:20 2002


If you have a strong imbalance between the # of ham and # of spam in your
training data (or even if you don't but can spare the effort), please do a
before-and-after test, where after adds the new option:

[Classifier]
experimental_ham_spam_imbalance_adjustment: True

I expect this option to go away and become the default, but it needs testing
first before I'll do that.

My c.l.py test has minor imbalance, and enabling this option doesn't really
matter on it:

filename:       cv    imbal
ham:spam:  20000:14000
                   20000:14000
fp total:        3       3
fp %:         0.01    0.01
fn total:        0       0
fn %:         0.00    0.00
unsure t:       91      95
unsure %:     0.27    0.28
real cost:  $48.20  $49.00
best cost:  $17.80  $17.80
h mean:       0.24    0.25
h sdev:       2.73    2.79
s mean:      99.95   99.96
s sdev:       1.40    1.32
mean diff:   99.71   99.71
k:           24.14   24.26

Since I have more ham than spam, the effect of the option is to "believe"
the hamcounts less than it used to, so that spamprobs have a harder time
getting close to 0.  That in turn makes everything a little spammier than it
used to be, so all the effects on the statistics are expected:  ham and spam
means both go up a little, ham sdev increases a little because strong ham
words aren't as strong as they were, spam sdev decreases because strong spam
words are stronger than they were, and a few edgecase hams drifted into
Unsure territory because they're judged to be a little spammier than they
were.  A *possible* effect this data doesn't suffer is an increase in FP
rate, which would again be due to everything looking a little spammier (I'm
not being accurate here!  it's really due to everything looking less hammy,
but the distinction is too subtle to belabor <wink>).  Likewise some FN may
be redeemed (but weren't in this test, since it had no FN to begin with).
All these effects will be stronger the larger the imbalance in your
ham::spam ratio.


Oops!  Looks like SourceForge is down -- I haven't been able to check in the
changes yet.  Keep trying until they show up <wink>.




More information about the Spambayes mailing list