[spambayes-dev] testing tweaks

Mon Aug 11 16:50:39 EDT 2003

[Tim explains his & Gary's uni/bigram idea]
> I only had time to run a few tests on that, and it looked very 
> promising,

You're right about that!  I had a little play around with this idea over
the weekend and it certainly improves the results.

I was lazy, so I did this the easiest way (well, what seemed the easiest
way), producing *token* bigrams, rather than *word* bigrams.  This means
that "This is a test" produces "This", "test" and "This test", since
"is" and "a" don't generate tokens.  It also means that our synthetic
tokens become part of bigrams (so a bigram could be skip information, or
headers and so on).  Whether it's better or worse than word bigrams, I
don't know (that's what testing is for!).  Also as a result of the
laziness, I left in the circular bigram that was created of the last
token and first token; since the first token is likely to be fairly
constant, I doubt this makes much difference.

Here are preliminary results [using "timtest.py -n5"].

The two columns with "fresh" in the filename are results with a
fresh-from-cvs spambayes.  The "tim1s" column are results where I
mistakenly allowed duplicate tokens to be generated (if a token had a
stronger difference than both the bigram with the previous token and the
next token then it is used twice).

The "tim2" columns are with this mistake removed, and the "tim3" columns
are like tim2, but also with Kenny's variant of Sean's
split_compound_words idea enabled.

filename:  sa_freshs       sa_tim3s          tim1s   tim2s   tim3s
                   sa_tim2s          freshs
ham:spam:  7580:7580       7580:7580       7900:15260      7900:15260
                   7580:7580       7900:15260      7900:15260
fp total:       44      47      47       2       2       2       2
fp %:         0.58    0.62    0.62    0.03    0.03    0.03    0.03
fn total:       16      12      13     176      94     128     127
fn %:         0.21    0.16    0.17    1.15    0.62    0.84    0.83
unsure t:      356     315     320     501     497     482     500
unsure %:     2.35    2.08    2.11    2.16    2.15    2.08    2.16
real cost: $527.20 $545.00 $547.00 $296.20 $213.40 $244.40 $247.00 best
cost: $592.40 $843.20 $825.20 $489.60 $379.20 $402.20 $416.40
h mean:       3.40    4.07    4.07    0.63    1.19    0.92    0.94
h sdev:      14.19   15.55   15.49    4.84    7.05    5.98    6.09
s mean:      97.94   98.76   98.74   94.52   96.23   96.02   95.99
s sdev:       9.43    7.80    7.88   18.67   14.79   15.54   15.64
mean diff:   94.54   94.69   94.67   93.89   95.04   95.10   95.05
k:            4.00    4.06    4.05    3.99    4.35    4.42    4.37

So a *big* win on the second set (which is from my actual mail; the
other corpus is based on the SpamAssassin public corpus) in terms of
fn's.  In fact the mistake variant did best - almost halving the number
of fn's.  Not sure about the first set - 3 more fp's, but 3 fewer fn's
and quite a drop in the number of unsures.  I care more about the second
set, anyway.

My (bsddb based) databases ballooned from about 1.5MB to about 10MB, but
what do I care?

Although the second set was all from my actual mail, the training set I
use is much smaller - about 400 ham and 4000 spam (a crazy imbalance,
but it works...).  These results are from this smaller set, using
"timtest.py -n3", first without the adjustment, and then with.

filename:    reals real_tims       real_tim_adjs
                           real_adjs
ham:spam:  754:8884        754:8884
                   754:8884        754:8884
fp total:        0       0       0       0
fp %:         0.00    0.00    0.00    0.00
fn total:      193      72     583     455
fn %:         2.17    0.81    6.56    5.12
unsure t:      638     470     541     438
unsure %:     6.62    4.88    5.61    4.54
real cost: $320.60 $166.00 $691.20 $542.60
best cost: $316.00 $197.20 $435.40 $391.60
h mean:       2.88    5.94    1.09    1.39
h sdev:      11.20   16.56    5.79    6.87
s mean:      92.54   95.98   84.92   87.91
s sdev:      21.09   14.86   32.43   29.64
mean diff:   89.66   90.04   83.83   86.52
k:            2.78    2.87    2.19    2.37

Again, a clear win for me (although the ham mean does jump up quite a
bit).

=Tony Meyer