Mailman 3 spambayes status - Python-Dev

2 Sep 2002

      I spent an enormous amount of time this weekend running tests against
various changes -- a "1% inspiration, 99% perspiration" kind of thing.
There are lots of words about the changes (both good and bad) in the comment
blocks and checkin msgs.  The biggest "conceptual" change is that I'm now
using (but only using) the Subject and From lines from the headers (my
earlier belief that the ham corpora Subject lines were too corrupted by
Mailman decorations turned out to be wrong).  Adding Subject lines gave a
remarkably small improvement, btw.  Most changes I tried either didn't
matter, or hurt.  Approximately 70 more blatant spams in the ham corpora
were identified and replaced with (randomly selected) legitimate msgs.

The f-p rate is too low now to measure changes with confidence.  Best guess
I can make from the evidence is that it's below 0.05% now.  The false
negative rate has improved more, and there's still plenty of those (so it's
still easy to be confident about whether changes do or don't help that).

Across all 20 runs (each training on 4000 ham + about 2750 spam, then
predicting against a different set with the same number of each), these are
the false positive and negative rates now (percentages; note that 0.025% is
a single message in the f-p column; a single msg in the f-n column is about
0.036%):

      f-p     f-n
    0.000   1.236
    0.000   1.164
    0.050   1.454
    0.000   1.599
    0.025   1.527
    0.025   1.236
    0.050   1.163
    0.025   1.309
    0.025   1.891
    0.000   1.418
    0.075   1.745
    0.050   1.708
    0.025   1.491
    0.000   0.836
    0.050   1.091
    0.025   1.309
    0.025   1.491
    0.000   1.127
    0.025   1.309
    0.050   1.636

The aggregate number of unique f-p across all runs is down to 8.
The aggregate number of unique f-n across all runs is 336.

The 8 ham messages for which at least one run claimed it was spam are
attached.  Note that I finally removed the "If AOL were a car" spam from the
good corpus; while it may or may not be amusing, it *was* automated bulk
email, even to the extent of including large blocks of random characters at
the end.  The message consisting almost entirely of quoting a Nigerian scam
message looks like it would be a "false postitive" under any scheme worth
using, but I left it in the good corpus (so it's still an f-p here), because
it wasn't bulk email (the original msg was, but the reply was not).

spambayes status

Tim Peters

tags

participants (1)