[Spambayes] Testers needed with unbalanced spam::ham training data

Richie Hindle richie@entrian.com
Sun Nov 17 23:07:25 2002


> [Classifier]
> experimental_ham_spam_imbalance_adjustment: True

Four runs, with and without experimental_ham_spam_imbalance_adjustment, and
with a 10:1 ham:spam imbalance either way: 

lowham[_adj]:  timcv.py -n10 --ham=20  --spam=200 -s1
lowspam[_adj]: timcv.py -n10 --ham=200 --spam=20  -s1

filename:   lowham lowham_adj
ham:spam:  200:2000
                   200:2000
fp total:       15       2
fp %:         7.50    1.00
fn total:        1       1
fn %:         0.05    0.05
unsure t:       37      42
unsure %:     1.68    1.91
real cost: $158.40  $29.40
best cost:  $67.20  $26.40
h mean:      17.41    8.38
h sdev:      31.13   20.20
s mean:      99.90   99.66
s sdev:       2.47    3.35
mean diff:   82.49   91.28
k:            2.46    3.88

filename:  lowspam lowspam_adj
ham:spam:  2000:200
                   2000:200
fp total:        0       1
fp %:         0.00    0.05
fn total:       10       1
fn %:         5.00    0.50
unsure t:       35      72
unsure %:     1.59    3.27
real cost:  $17.00  $25.40
best cost:  $10.80   $7.00
h mean:       0.18    1.61
h sdev:       2.08    7.13
s mean:      89.39   96.69
s sdev:      23.92   10.59
mean diff:   89.21   95.08
k:            3.43    5.37

The introduced fp in lowspam_adj is a very spammy HTML email from an ISP -
it's always showed up as an fp in my corpus.

-- 
Richie Hindle
richie@entrian.com




More information about the Spambayes mailing list