[Spambayes] 'sender' and 'reply-to' tokenising.
Anthony Baxter
anthony@interlink.com.au
Fri Nov 1 07:54:47 2002
comments in tokenizer.py:
# Dang -- I can't use Sender:. If I do,
# 'sender:email name:python-list-admin'
# becomes the most powerful indicator in the whole database.
#
# From: # this helps both rates
# Reply-To: # my error rates are too low now to tell about this
# # one (smalls wins & losses across runs, overall
# # not significant), so leaving it out
So now we have things like h/s mean/sdev, we get more useful data. I tried
enabling tokenization of both 'sender' and 'reply-to' (and both) along with
the 'from' line. The left-hand column is the default.
filename: from from+sender from+sender+replyto
from+replyto
ham:spam: 11192:1826 11192:1826
11192:1826 11192:1826
fp total: 7 6 7 6
fp %: 0.06 0.05 0.06 0.05
fn total: 5 4 5 4
fn %: 0.27 0.22 0.27 0.22
unsure t: 80 82 80 81
unsure %: 0.61 0.63 0.61 0.62
real cost: $91.00 $80.40 $91.00 $80.20
best cost: $28.00 $27.20 $28.20 $25.80
h mean: 0.62 1.32 0.63 1.11
h sdev: 4.27 4.42 4.19 4.19
s mean: 98.69 98.66 98.68 98.65
s sdev: 7.69 7.86 7.74 7.92
mean diff: 98.07 97.34 98.05 97.54
k: 8.20 7.93 8.22 8.05
Summary: 'sender' was an across-the-board lose for me. It knocked out
a fp and a fn, but did considerable damage to both ham mean and sdev,
and spam mean and sdev.
'reply-to' tightened up ham scores, and loosened spam scores (but not as
much). I'd suggest re-enabling reply-to with the following patch:
--- tokenizer.py 31 Oct 2002 15:43:55 -0000 1.59
+++ tokenizer.py 1 Nov 2002 07:51:34 -0000
@@ -1082,10 +1082,9 @@
# becomes the most powerful indicator in the whole database.
#
# From: # this helps both rates
- # Reply-To: # my error rates are too low now to tell about this
- # # one (smalls wins & losses across runs, overall
- # # not significant), so leaving it out
- for field in ('from',):
+ # Reply-To: # this tightens up ham for me (anthony) and makes spam
+ # # slightly worse (but the ham improvement is more)
+ for field in ('from', 'reply-to'):
prefix = field + ':'
x = msg.get(field, 'none').lower()
for w in x.split():
Someone else want to repeat this test?
Anthony