[spambayes-dev] Another incremental training idea...

Toby Dickenson tdickenson at geminidataloggers.com
Thu Jan 15 06:48:26 EST 2004


On Thursday 15 January 2004 03:05, Tim Peters wrote:
> [Skip Montanaro]
>
> > ...
> > It does seem a bit arbitrary, but the system seems to suggest
> > we need to be slaves to balance and that's one way to get it.
>
> Cross validation testing is measuring random-time-order TOE performance,
> and we know imbalance hurts that.

Ive finally got the cross validation tools working here, and the first thing I 
looked at was imbalance. My normal training set is currently 14k hams and 2k 
spams. This test compared that imbalance against three independantly selected 
balanced sets with 2k of both.

If Im reading this right, my 7:1 imbalance doesnt hurt me.

filename:    unbal    bal1    bal2    bal3
ham:spam:  14560:1992      1992:1992
                   1992:1992       1992:1992
fp total:        0       0       1       0
fp %:         0.00    0.00    0.05    0.00
fn total:       12       6       8       6
fn %:         0.60    0.30    0.40    0.30
unsure t:      102      21      23      29
unsure %:     0.62    0.53    0.58    0.73
real cost:  $32.40  $10.20  $22.60  $11.80
best cost:  $27.60   $7.00   $9.80   $8.60
h mean:       0.11    0.23    0.30    0.32
h sdev:       1.89    2.47    3.46    3.26
s mean:      96.93   99.06   99.04   99.02
s sdev:      12.11    6.88    6.98    7.21
mean diff:   96.82   98.83   98.74   98.70
k:            6.92   10.57    9.46    9.43

-- 
Toby Dickenson




More information about the spambayes-dev mailing list