[Spambayes] Improved comparison of classifier changes?

Fri Mar 7 10:29:04 EST 2003

In message:  <9891913C5BFE87429D71E37F08210CB9297597 at zeus.sfhq.friskit.com>
             "Piers Haken" <piersh at friskit.com> writes:
>(This came to me in a dream. No, really...)
>
>When comparing two different classifier/tokenizer strategies, instead of
>just comparing the numbers of false  negatives and positives, how about
>comparing some function (product, sum, average,
>some-more-appropriate-statistical-function?) of the spam probability of
>all messages in each classification (spam, ham, false-positive,
>false-negative)? This might give a slightly better indication of not
>just the numbers of messages that were classified correctly/incorrectly,
>but of how sure the classifier was when it made those decisions.
>
>.. or was I just dreaming...?

Here's sample output from table.py:

filename:      rcb     rcB     rCb     rCB     Rcb     RcB     RCb     RCB
ham:spam:  2000:2000       2000:2000       2000:2000       2000:2000
                   2000:2000       2000:2000       2000:2000       2000:2000
fp total:        3       3       3       3       3       3       3       3
fp %:         0.15    0.15    0.15    0.15    0.15    0.15    0.15    0.15
fn total:       12      14      16      14      12      12      12      12
fn %:         0.60    0.70    0.80    0.70    0.60    0.60    0.60    0.60
unsure t:       53      37      50      39      40      31      37      32
unsure %:     1.32    0.93    1.25    0.97    1.00    0.78    0.93    0.80
real cost:  $52.60  $51.40  $56.00  $51.80  $50.00  $48.20  $49.40  $48.40
best cost:  $48.20  $45.20  $49.20  $45.60  $37.20  $38.80  $40.60  $38.60
h mean:       0.40    0.32    0.35    0.32    0.31    0.30    0.29    0.29
h sdev:       5.39    4.71    5.12    4.68    4.55    4.47    4.47    4.43
s mean:      98.45   98.68   98.35   98.68   98.75   98.85   98.72   98.85
s sdev:       9.76    9.57   10.46    9.58    9.08    9.06    9.37    9.11
mean diff:   98.05   98.36   98.00   98.36   98.44   98.55   98.43   98.56
k:            6.47    6.89    6.29    6.90    7.22    7.28    7.11    7.28

So yes, when using the test harness and associated tools, we do
compare more than just the fp and fn counts.  We also look at
percentages, a weighted cost function, the best possible cost
achievable just by moving the ham and spam cutoffs, and the
mean scores, their separation, and their standard deviations.

We just haven't done much tokenizer testing lately, so these
reports aren't obvious in the recent archives.

- Alex