[Spambayes] 'sender' and 'reply-to' tokenising.

Fri Nov 1 07:54:47 2002

comments in tokenizer.py:

        # Dang -- I can't use Sender:.  If I do,
        #     'sender:email name:python-list-admin'
        # becomes the most powerful indicator in the whole database.
        #
        # From:         # this helps both rates
        # Reply-To:     # my error rates are too low now to tell about this
        #               # one (smalls wins & losses across runs, overall
        #               # not significant), so leaving it out

So now we have things like h/s mean/sdev, we get more useful data. I tried
enabling tokenization of both 'sender' and 'reply-to' (and both) along with
the 'from' line. The left-hand column is the default.

filename:     from from+sender     from+sender+replyto
                           from+replyto   
ham:spam:  11192:1826      11192:1826     
                   11192:1826      11192:1826
fp total:        7       6       7       6
fp %:         0.06    0.05    0.06    0.05
fn total:        5       4       5       4
fn %:         0.27    0.22    0.27    0.22
unsure t:       80      82      80      81
unsure %:     0.61    0.63    0.61    0.62
real cost:  $91.00  $80.40  $91.00  $80.20
best cost:  $28.00  $27.20  $28.20  $25.80
h mean:       0.62    1.32    0.63    1.11
h sdev:       4.27    4.42    4.19    4.19
s mean:      98.69   98.66   98.68   98.65
s sdev:       7.69    7.86    7.74    7.92
mean diff:   98.07   97.34   98.05   97.54
k:            8.20    7.93    8.22    8.05



Summary: 'sender' was an across-the-board lose for me. It knocked out
a fp and a fn, but did considerable damage to both ham mean and sdev,
and spam mean and sdev.
'reply-to' tightened up ham scores, and loosened spam scores (but not as
much). I'd suggest re-enabling reply-to with the following patch:

--- tokenizer.py        31 Oct 2002 15:43:55 -0000      1.59
+++ tokenizer.py        1 Nov 2002 07:51:34 -0000
@@ -1082,10 +1082,9 @@
         # becomes the most powerful indicator in the whole database.
         #
         # From:         # this helps both rates
-        # Reply-To:     # my error rates are too low now to tell about this
-        #               # one (smalls wins & losses across runs, overall
-        #               # not significant), so leaving it out
-        for field in ('from',):
+        # Reply-To:     # this tightens up ham for me (anthony) and makes spam
+        #               # slightly worse (but the ham improvement is more) 
+        for field in ('from', 'reply-to'):
             prefix = field + ':'
             x = msg.get(field, 'none').lower()
             for w in x.split():

Someone else want to repeat this test?

Anthony