[Spambayes] Something to test

Sjoerd Mullender sjoerd@acm.org
Tue Nov 5 10:42:42 2002


On Sun, Nov 3 2002 Tim Peters wrote:

> Index: tokenizer.py
> ===================================================================
> RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
> retrieving revision 1.60
> diff -c -r1.60 tokenizer.py
> *** tokenizer.py        1 Nov 2002 16:10:13 -0000       1.60
> --- tokenizer.py        3 Nov 2002 08:31:44 -0000
> ***************
> *** 1178,1183 ****
> --- 1178,1185 ----
>                       x2n[x] = x2n.get(x, 0) + 1
>           for x in x2n.items():
>               yield "header:%s:%d" % x
> +         for x in options.safe_headers - Set([k.lower() for k in x2n]):
> +             yield "noheader:" + x
> 
>       def tokenize_body(self, msg, maxword=options.skip_max_word_size):
>           """Generate a stream of tokens from an email Message.

Here are my results:

filename:     cv1s    cv2s
ham:spam:  11850:3360     
                   11850:3360
fp total:        3       3
fp %:         0.03    0.03
fn total:        4       4
fn %:         0.12    0.12
unsure t:      103     100
unsure %:     0.68    0.66
real cost:  $54.60  $54.00
best cost:  $26.60  $25.80
h mean:       0.20    0.19
h sdev:       3.15    3.15
s mean:      99.29   99.28
s sdev:       5.94    5.95
mean diff:   99.09   99.09
k:           10.90   10.89

The difference between the two runs: 3 unsure messages got nailed
correctly, so it's a marginal improvement.

-- Sjoerd Mullender <sjoerd@acm.org>