[Spambayes] Email client integration -- what's needed?

Sun Nov 3 18:28:18 2002

[Richie Hindle]
> ...
> You're right - losing headers will make a difference, even with the fairly
> minimal header tokenising we currently do.  When I added the Unsure
> classification to pop3proxy, I tested it by forwarding a bunch of spams to
> myself and they all came out Unsure where they had been Yes before - at
> first I thought it was a bug, but then a couple of genuine spams rolled in
> and were classified correctly.

There's indeed a *lot* of info in the headers we look at by default.  About
a full day of work went into deciding on each one of those, and finding the
most helpful way to tokenize each.  Alas, most of that work went into
discovering which headers didn't improve results, or gave great results for
bogus reasons.  OTOH, at the start we didn't look at headers *at all* in
this project (it took a long time to sort out the problems with headers in
mixed-source corpora), so we worked harder than other projects at tokenizing
the body in effective ways too.

Here's the tokenization generator:

    def tokenize(self, obj):
        msg = self.get_message(obj)

        for tok in self.tokenize_headers(msg):
            yield tok
        for tok in self.tokenize_body(msg):
            yield tok

If we comment out either loop, the classifier will see only the headers or
only the body.

Here are results from doing that, on the same randomized set of 2000 ham +
2000 spam from my c.l.py test, with ham_cutoff=0.2 and spam_cutoff=0.8, and
also using the "generate tokens for the absence of key header lines too"
patch I posted in the wee hours.  "before" is looking at both hdrs and body,
"hdr" looking only at headers (no bodies), and "body" looking only at bodies
(no headers):

filename:   before     hdr    body
ham:spam:  2000:2000       2000:2000
                   2000:2000
fp total:        1       0       5
fp %:         0.05    0.00    0.25
fn total:        0       0       1
fn %:         0.00    0.00    0.05
unsure t:       20      29      62
unsure %:     0.50    0.72    1.55
real cost:  $14.00   $5.80  $63.40
best cost:   $2.00   $1.60  $10.40
h mean:       0.55    0.66    1.68
h sdev:       4.50    3.46    8.02
s mean:      99.91   99.40   99.56
s sdev:       1.64    3.46    4.46
mean diff:   99.36   98.74   97.88
k:           16.18   14.27    7.84

A higher spam_cutoff would have helped the body column a lot, but it's clear
we're getting an enormous amount of useful info out of the handful of header
lines we look at by default; indeed, the hdr column is marginally better
than the before column!

In the body column, the FN was one of those brief "Paul, it was great to see
you today.  The proposal will be ready tomorrow.  Heidi." spams.  The only
real spam clues in those are in the headers.  The FP are harder to
characterize, a mix of conference announcements, one-liner "unsubscribe"
thingies, and thoroughly off-topic posts.  By default they get redeemed
because the headers contain clues that they came from a real person, and
weren't posted using spammer software that leaves behind strange
capitalization (BTW, "MiME-Version:", with the lowercase i, turned out to be
one the highest-spamprob words in my personal email classifier too -- wasn't
unique to BruceG's spam).

Using twice as much test data makes a mildly interesting point:

filename:   before     hdr    body
ham:spam:  4000:4000       4000:4000
                   4000:4000
fp total:        1       0       4
fp %:         0.03    0.00    0.10
fn total:        0       0       1
fn %:         0.00    0.00    0.03
unsure t:       28      71     114
unsure %:     0.35    0.89    1.43
real cost:  $15.60  $14.20  $63.80
best cost:   $2.40   $3.80  $20.00
h mean:       0.36    0.63    1.44
h sdev:       3.28    3.68    6.89
s mean:      99.93   99.44   99.64
s sdev:       1.42    3.40    4.07
mean diff:   99.57   98.81   98.20
k:           21.19   13.96    8.96

The h and s means & sdevs in the hdr column barely budge, but in the body
column obviously "improve".  That suggests there's more variability in the
bodies (than in the headers) of both ham and spam.

Bottom line:  the header info is vital in this scheme for best results, but
you could get a useful classifier out of headers alone or bodies alone!