[Spambayes] Mining the headers
T. Alexander Popiel
popiel@wolfskeep.com
Sat Oct 26 21:34:09 2002
Tim mentioned three tokenizer options (mine_received_headers,
count_all_header_lines, basic_header_tokenize). I hadn't
played with these yet, so I ran the 8 combinations of these.
Summary: both mine_received_headers and basic_header_tokenize
seem good for me, but count_all_header_lines is a minor lose.
r == mine_received_headers: False
R == mine_received_headers: True
c == count_all_header_lines: False
C == count_all_header_lines: True
b == basic_header_tokenize: False
B == basic_header_tokenize: True
Other options are:
[Classifier]
use_chi_squared_combining: True
[TestDriver]
show_false_negatives: False
show_false_positives: False
show_unsure: False
ham_cutoff: 0.20
spam_cutoff: 0.90
-> <stat> tested 200 hams & 200 spams against 1800 hams & 1800 spams
[...]
filename: rcb rcB rCb rCB Rcb RcB RCb RCB
ham:spam: 2000:2000 2000:2000 2000:2000 2000:2000
2000:2000 2000:2000 2000:2000 2000:2000
fp total: 3 3 3 3 3 3 3 3
fp %: 0.15 0.15 0.15 0.15 0.15 0.15 0.15 0.15
fn total: 12 14 16 14 12 12 12 12
fn %: 0.60 0.70 0.80 0.70 0.60 0.60 0.60 0.60
unsure t: 53 37 50 39 40 31 37 32
unsure %: 1.32 0.93 1.25 0.97 1.00 0.78 0.93 0.80
real cost: $52.60 $51.40 $56.00 $51.80 $50.00 $48.20 $49.40 $48.40
best cost: $48.20 $45.20 $49.20 $45.60 $37.20 $38.80 $40.60 $38.60
h mean: 0.40 0.32 0.35 0.32 0.31 0.30 0.29 0.29
h sdev: 5.39 4.71 5.12 4.68 4.55 4.47 4.47 4.43
s mean: 98.45 98.68 98.35 98.68 98.75 98.85 98.72 98.85
s sdev: 9.76 9.57 10.46 9.58 9.08 9.06 9.37 9.11
mean diff: 98.05 98.36 98.00 98.36 98.44 98.55 98.43 98.56
k: 6.47 6.89 6.29 6.90 7.22 7.28 7.11 7.28
Yes, it looks like there's good info in the headers. Counting
the header lines doesn't appear to be a helpful way to get at
that information, but mining the received headers and just doing
basic tokenization over all the headers both seem to work, and
work even better together.
This is on my website at:
http://www.wolfskeep.com/~popiel/spambayes/headers
- Alex