[Spambayes] Something to test
Sjoerd Mullender
sjoerd@acm.org
Tue Nov 5 10:42:42 2002
On Sun, Nov 3 2002 Tim Peters wrote:
> Index: tokenizer.py
> ===================================================================
> RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
> retrieving revision 1.60
> diff -c -r1.60 tokenizer.py
> *** tokenizer.py 1 Nov 2002 16:10:13 -0000 1.60
> --- tokenizer.py 3 Nov 2002 08:31:44 -0000
> ***************
> *** 1178,1183 ****
> --- 1178,1185 ----
> x2n[x] = x2n.get(x, 0) + 1
> for x in x2n.items():
> yield "header:%s:%d" % x
> + for x in options.safe_headers - Set([k.lower() for k in x2n]):
> + yield "noheader:" + x
>
> def tokenize_body(self, msg, maxword=options.skip_max_word_size):
> """Generate a stream of tokens from an email Message.
Here are my results:
filename: cv1s cv2s
ham:spam: 11850:3360
11850:3360
fp total: 3 3
fp %: 0.03 0.03
fn total: 4 4
fn %: 0.12 0.12
unsure t: 103 100
unsure %: 0.68 0.66
real cost: $54.60 $54.00
best cost: $26.60 $25.80
h mean: 0.20 0.19
h sdev: 3.15 3.15
s mean: 99.29 99.28
s sdev: 5.94 5.95
mean diff: 99.09 99.09
k: 10.90 10.89
The difference between the two runs: 3 unsure messages got nailed
correctly, so it's a marginal improvement.
-- Sjoerd Mullender <sjoerd@acm.org>