[Spambayes] A couple of small tokenizer experiments.
Anthony Baxter
anthony@interlink.com.au
Thu Nov 14 02:25:16 2002
>>> Tim Peters wrote
> Something that helped: it now generates log-count "no real name" metatokens
> too for address headers without real-name parts.
>
> 'from:no real name:2**0' 0.933186
I saw
'from:no real name:2**0' 0.683287
'reply-to:no real name:2**0' 0.873138
in the horror corpus.
> Yup, it's a small win. I can't use it my c.l.py test, but should be able to
> on the general python.org corpus (plus, of course, my own email).
On the nasty corpus,
filename: shout_from
shout_fromccetc
ham:spam: 5000:2500
5000:2500
fp total: 10 7
fp %: 0.20 0.14
fn total: 5 5
fn %: 0.20 0.20
unsure t: 297 257
unsure %: 3.96 3.43
real cost: $164.40 $126.40
best cost: $99.80 $76.60
h mean: 4.12 3.53
h sdev: 12.63 11.53
s mean: 99.49 99.47
s sdev: 5.33 5.46
mean diff: 95.37 95.94
k: 5.31 5.65
Goes from:
-> best cost for all runs: $99.80
-> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20
-> achieved at ham & spam cutoffs 0.71 & 0.99
-> fp 7; fn 15; unsure ham 37; unsure spam 37
-> fp rate 0.14%; fn rate 0.6%; unsure rate 0.987%
to:
-> best cost for all runs: $76.60
-> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20
-> achieved at 2 cutoff pairs
-> smallest ham & spam cutoffs 0.69 & 0.99
-> fp 5; fn 14; unsure ham 29; unsure spam 34
-> fp rate 0.1%; fn rate 0.56%; unsure rate 0.84%
-> largest ham & spam cutoffs 0.7 & 0.99
-> fp 5; fn 14; unsure ham 29; unsure spam 34
-> fp rate 0.1%; fn rate 0.56%; unsure rate 0.84%
More information about the Spambayes
mailing list