[Spambayes] Testers needed with unbalanced spam::ham training data

Sun Nov 17 23:31:26 2002

[Richie Hindle, trying

  [Classifier]
  experimental_ham_spam_imbalance_adjustment: True
]

Thank you!

> Four runs, with and without
> experimental_ham_spam_imbalance_adjustment, and
> with a 10:1 ham:spam imbalance either way:
>
> lowham[_adj]:  timcv.py -n10 --ham=20  --spam=200 -s1
> lowspam[_adj]: timcv.py -n10 --ham=200 --spam=20  -s1
>
> filename:   lowham lowham_adj
> ham:spam:  200:2000
>                    200:2000
> fp total:       15       2
> fp %:         7.50    1.00
> fn total:        1       1
> fn %:         0.05    0.05
> unsure t:       37      42
> unsure %:     1.68    1.91
> real cost: $158.40  $29.40
> best cost:  $67.20  $26.40
> h mean:      17.41    8.38
> h sdev:      31.13   20.20
> s mean:      99.90   99.66
> s sdev:       2.47    3.35
> mean diff:   82.49   91.28
> k:            2.46    3.88

So the effect of the adjustment is to make everything less spammy:  both
means decrease, ham sdev decreases, spam sdev increases, FP get redeemed,
and FN get more likely but less so than Unsures get more likely.  The spread
is small enough that the bottom-line increase in k is important, and
everything works as hoped here.

> filename:  lowspam lowspam_adj
> ham:spam:  2000:200
>                    2000:200
> fp total:        0       1
> fp %:         0.00    0.05
> fn total:       10       1
> fn %:         5.00    0.50
> unsure t:       35      72
> unsure %:     1.59    3.27
> real cost:  $17.00  $25.40
> best cost:  $10.80   $7.00
> h mean:       0.18    1.61
> h sdev:       2.08    7.13
> s mean:      89.39   96.69
> s sdev:      23.92   10.59
> mean diff:   89.21   95.08
> k:            3.43    5.37

Now the effect is to make everything less hammy, so mirror image:  both
means increase, ham sdev increases, spam sdev decreases, FN get redeemed,
and FP get more likely but less so than Unsures get more likely.  So again
everything worked as hoped, and the bottom-line increase in k is again a
Good Thing.

Great!  That's all I could have hoped for.  If you hoped for more, you were
being unrealistic <wink>.

Curious:  both before and after, you got better results training on a lot
more ham than spam than the reverse.  Most previous reports have been the
opposite (in my own tests, I haven't noted a reliable trend in either
direction there).

> The introduced fp in lowspam_adj is a very spammy HTML email from
> an ISP - it's always showed up as an fp in my corpus.

Since the after "best cost" was under $10, it's certain that the post-run
histogram analysis found cutoffs where you would have gotten no FP.  Whether
those are cutoffs you'd be comfortable with I can't say.