[Spambayes] Something to test

Tim Peters tim.one@comcast.net
Tue Nov 5 00:48:40 2002


[Tim]
> This little patch arranges to create "noheader:HEADERNAME" tokens for
> headers in options.safe_headers that *don't* appear in a msg's headers.

This has been checked in now, disabled by default, under bool option name
record_header_absence.

[Anthony Baxter]

Thanks for testing!

> filename:    before  after
> ham:spam:  11192:1826
>                    11192:1826
> fp total:        0       1
> fp %:         0.00    0.01
> fn total:        7       8
> fn %:         0.38    0.44
> unsure t:      106     107
> unsure %:     0.81    0.82
> real cost:  $28.20  $39.40
> best cost:  $28.20  $30.40
> h mean:       0.63    0.42
> h sdev:       4.19    4.19
> s mean:      98.68   98.63
> s sdev:       7.74    7.95
> mean diff:   98.05   98.21
> k:            8.22    8.09

Wow -- it cut your ham mean by a third <wink>.

> The additional fp was a mail-out from Nettwerk (that I've signed up
> for, but which are _incredibly_ spammy) that went from 0.956 to 0.964,
> where my spam cutoff is 0.96. The noheader: errors-to was the killer
> clue that pushed it over the edge. The spam situation is considerably
> worse. The additional false negative was something that went from 0.467
> to 0.431 (ham_cutoff 0.45). The damage came from
>   prob('noheader:mime-version') = 0.245329
> (It was a very short spam)

So, in all, it nudged two marginal msgs over the edge, but in the wrong
directions.  So I disabled it by default.  It helps python.org tests,
though, so it's an option now.

> One fn went from 0.27 to 0.029, due to:
>   prob('noheader:subject') = 0.0042591
>   prob('noheader:to') = 0.0652536

Those are bizarre.  From where do you get ham lacking Subject and To
headers?  In my personal classifier,

                   #h  #s  spamprob
'noheader:to'      10  95  0.884678455795
'noheader:subject'  2  16  0.858858950186

Is there some systematic reason for why you've got lots of ham without key
header lines?  Your noheader:subject spamprob in particular is astonishingly
low.

>   prob('noheader:mime-version') = 0.245329
>
> It made pretty much all of my fn's at least slightly worse, if not
> much worse.

The lack of common headers in your ham is the mystery to me.  Try to figure
out why that is?  For example, perhaps you have some systematic source of
ham creating headers the email pkg can't parse.  In that case we fall back
to the raw body text, and don't get any header info at all.  But in that
case, we should learn *why* the email pkg is blowing up, and worm around it.

For the same reason your FN got worse, your FN would get better if these
things had the high spamprobs they were expected to have (and do have, in
all my tests; nobody else has reported on this experiment, alas).