[Spambayes] A couple of small tokenizer experiments.

Thu Nov 14 02:25:16 2002

>>> Tim Peters wrote
> Something that helped:  it now generates log-count "no real name" metatokens
> too for address headers without real-name parts.
> 
>         'from:no real name:2**0' 0.933186

I saw
        'from:no real name:2**0' 0.683287
        'reply-to:no real name:2**0' 0.873138

in the horror corpus.

> Yup, it's a small win.  I can't use it my c.l.py test, but should be able to
> on the general python.org corpus (plus, of course, my own email).

On the nasty corpus,

filename:  shout_from     
                   shout_fromccetc
ham:spam:  5000:2500      
                   5000:2500
fp total:       10       7
fp %:         0.20    0.14
fn total:        5       5
fn %:         0.20    0.20
unsure t:      297     257
unsure %:     3.96    3.43
real cost: $164.40 $126.40
best cost:  $99.80  $76.60
h mean:       4.12    3.53
h sdev:      12.63   11.53
s mean:      99.49   99.47
s sdev:       5.33    5.46
mean diff:   95.37   95.94
k:            5.31    5.65

Goes from:
-> best cost for all runs: $99.80
-> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20
-> achieved at ham & spam cutoffs 0.71 & 0.99
->     fp 7; fn 15; unsure ham 37; unsure spam 37
->     fp rate 0.14%; fn rate 0.6%; unsure rate 0.987%

to:
-> best cost for all runs: $76.60
-> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20
-> achieved at 2 cutoff pairs
-> smallest ham & spam cutoffs 0.69 & 0.99
->     fp 5; fn 14; unsure ham 29; unsure spam 34
->     fp rate 0.1%; fn rate 0.56%; unsure rate 0.84%
-> largest ham & spam cutoffs 0.7 & 0.99
->     fp 5; fn 14; unsure ham 29; unsure spam 34
->     fp rate 0.1%; fn rate 0.56%; unsure rate 0.84%