[Spambayes] Mining the headers

T. Alexander Popiel popiel@wolfskeep.com
Sat Oct 26 21:34:09 2002


Tim mentioned three tokenizer options (mine_received_headers,
count_all_header_lines, basic_header_tokenize).  I hadn't
played with these yet, so I ran the 8 combinations of these.

Summary: both mine_received_headers and basic_header_tokenize
seem good for me, but count_all_header_lines is a minor lose.

 r == mine_received_headers: False
 R == mine_received_headers: True
 c == count_all_header_lines: False
 C == count_all_header_lines: True
 b == basic_header_tokenize: False
 B == basic_header_tokenize: True

Other options are:
 [Classifier]
 use_chi_squared_combining: True

 [TestDriver]
 show_false_negatives: False
 show_false_positives: False
 show_unsure: False
 ham_cutoff: 0.20
 spam_cutoff: 0.90

-> <stat> tested 200 hams & 200 spams against 1800 hams & 1800 spams
[...]
filename:      rcb     rcB     rCb     rCB     Rcb     RcB     RCb     RCB
ham:spam:  2000:2000       2000:2000       2000:2000       2000:2000      
                   2000:2000       2000:2000       2000:2000       2000:2000
fp total:        3       3       3       3       3       3       3       3
fp %:         0.15    0.15    0.15    0.15    0.15    0.15    0.15    0.15
fn total:       12      14      16      14      12      12      12      12
fn %:         0.60    0.70    0.80    0.70    0.60    0.60    0.60    0.60
unsure t:       53      37      50      39      40      31      37      32
unsure %:     1.32    0.93    1.25    0.97    1.00    0.78    0.93    0.80
real cost:  $52.60  $51.40  $56.00  $51.80  $50.00  $48.20  $49.40  $48.40
best cost:  $48.20  $45.20  $49.20  $45.60  $37.20  $38.80  $40.60  $38.60
h mean:       0.40    0.32    0.35    0.32    0.31    0.30    0.29    0.29
h sdev:       5.39    4.71    5.12    4.68    4.55    4.47    4.47    4.43
s mean:      98.45   98.68   98.35   98.68   98.75   98.85   98.72   98.85
s sdev:       9.76    9.57   10.46    9.58    9.08    9.06    9.37    9.11
mean diff:   98.05   98.36   98.00   98.36   98.44   98.55   98.43   98.56
k:            6.47    6.89    6.29    6.90    7.22    7.28    7.11    7.28

Yes, it looks like there's good info in the headers.  Counting
the header lines doesn't appear to be a helpful way to get at
that information, but mining the received headers and just doing
basic tokenization over all the headers both seem to work, and
work even better together.

This is on my website at:
  http://www.wolfskeep.com/~popiel/spambayes/headers

- Alex