Third result ... RE: [Spambayes] First result from Gary Robinson's ideas

Thu, 19 Sep 2002 03:17:25 -0400

I've checked in sane changes to the code base now, so that you can try Gary
Robinson's increasingly remarkable probability combining scheme (it was
amazing when I first tried it, and triply so when Neale reported isomorphic
results on a wholly different corpus).  Just set

    [Classifier]
    use_robinson_probability: True

    [TestDriver]
    spam_cutoff: 0.50

as a pair.  The best possible value of spam_cutoff is easy to deduce from
the final score histograms after a run, but for my and Neale's runs 0.50 was
indeed optimal.

I just ran one small-subset test on this before checking it in, on a random
subset of 1000 each of my ham and spam sets.  Results were identical across
all 10 runs (no change in fp or fn rates, no change in total unique fp or
fn).  The score histograms under Graham's combining scheme are "the usual"
bipolar extreme:

Ham distribution for all runs:
* = 17 items
  0.00 998 ***********************************************************
  2.50   0
  5.00   0
  7.50   0
 10.00   0
 12.50   0
 15.00   0
 17.50   0
 20.00   0
 22.50   0
 25.00   0
 27.50   0
 30.00   0
 32.50   0
 35.00   0
 37.50   0
 40.00   0
 42.50   0
 45.00   0
 47.50   0
 50.00   0
 52.50   0
 55.00   0
 57.50   0
 60.00   0
 62.50   0
 65.00   0
 67.50   0
 70.00   0
 72.50   0
 75.00   0
 77.50   0
 80.00   0
 82.50   0
 85.00   0
 87.50   0
 90.00   0
 92.50   0
 95.00   0
 97.50   2 *

Spam distribution for all runs:
* = 17 items
  0.00   2 *
  2.50   0
  5.00   0
  7.50   0
 10.00   0
 12.50   0
 15.00   0
 17.50   0
 20.00   0
 22.50   0
 25.00   0
 27.50   0
 30.00   0
 32.50   0
 35.00   0
 37.50   1 *
 40.00   0
 42.50   0
 45.00   0
 47.50   0
 50.00   1 *
 52.50   0
 55.00   0
 57.50   0
 60.00   0
 62.50   0
 65.00   0
 67.50   0
 70.00   0
 72.50   0
 75.00   0
 77.50   0
 80.00   0
 82.50   0
 85.00   0
 87.50   0
 90.00   0
 92.50   0
 95.00   1 *
 97.50 995 ***********************************************************

Under Gary's scheme, again more interesting:

Ham distribution for all runs:
* = 14 items
  0.00 816 ***********************************************************
  2.50  37 ***
  5.00  13 *
  7.50   2 *
 10.00  12 *
 12.50  34 ***
 15.00  16 **
 17.50  13 *
 20.00   9 *
 22.50  13 *
 25.00  10 *
 27.50   7 *
 30.00   8 *
 32.50   3 *
 35.00   2 *
 37.50   1 *
 40.00   1 *
 42.50   1 *
 45.00   0
 47.50   0

Note that the two false positives live in this range:

 50.00   1 *
 52.50   0
 55.00   0
 57.50   1 *

 60.00   0
 62.50   0
 65.00   0
 67.50   0
 70.00   0
 72.50   0
 75.00   0
 77.50   0
 80.00   0
 82.50   0
 85.00   0
 87.50   0
 90.00   0
 92.50   0
 95.00   0
 97.50   0

Spam distribution for all runs:
* = 15 items
  0.00   0
  2.50   0
  5.00   0
  7.50   0
 10.00   0
 12.50   0
 15.00   0
 17.50   0
 20.00   0
 22.50   0
 25.00   0
 27.50   0
 30.00   0
 32.50   0
 35.00   0

Three of the four false negatives live here:

 37.50   1 *
 40.00   1 *
 42.50   1 *

 45.00   0
 47.50   0

The fourth false negative scored exactly 0.50, and lives here:

 50.00   2 *

 52.50   4 *
 55.00   1 *
 57.50   0
 60.00   4 *
 62.50   1 *
 65.00   8 *
 67.50   6 *
 70.00   3 *
 72.50   5 *
 75.00   8 *
 77.50   9 *
 80.00   9 *
 82.50  16 **
 85.00   9 *
 87.50   0
 90.00   1 *
 92.50   4 *
 95.00  14 *
 97.50 893 ************************************************************

So how did a spam score exactly 0.50?  Like this:

"""
Data/Spam/Set9/1673.txt
prob = 0.5
prob('control: MessageParseError') = 0.5

Return-Path: <qvtlv@tjohoo.se>
Delivered-To: em-ca-bait@em.ca
Received: (qmail 10178 invoked from network); 5 Jun 2002 10:19:14 -0000
Received: from serverw3.easyw3.fr (HELO easyw4.easyw3.fr) (194.3.175.1)
  by churchill.factcomp.com with SMTP; 5 Jun 2002 10:19:14 -0000
Received: from mail.tjohoo.se (216.50.157.50 [216.50.157.50]) by
easyw4.easyw3.fr with SMTP (Microsoft Exchange Internet Mail Service Version
5.5.2655.55)
        id MAQYC3S9; Wed, 5 Jun 2002 07:49:12 +0200
From: "Marietta" <qvtlv@tjohoo.se>
To: "kpe@cs.com
" <kpe@cs.com
>
Subject: look good and have a great summer
Content-Type: text/plain; charset="us-ascii";format=flowed
Content-Transfer-Encoding: 7bit
Content-Length: 1058
Lines: 34

... and the rest is irrelevant ...
"""

That is, the email parser barfed on the malformed "To" header, so
"MessageParseError" was the *only* token generated, and that didn't appear
in the training set at all (so it got our "unknown word prob").  Note that
something is rated spam only if its score is strictly greater than
spam_cutoff, and 0.5 isn't greater than 0.5 <wink>.  I (or someone else --
please <wink>?) should probably change the tokenizer to back off to the raw
message body when the email parser gives up.

This was a long spam, but I guarantee we would have nailed it just from the
first paragraph:

"""
As seen on NBC, CBS, CNN, and even Oprah! The health
discovery that actually reverses aging while burning fat,
without dieting or exercise! This proven discovery has even
been reported on by the New England Journal of Medicine.
Forget  aging and dieting forever! And it's Guaranteed!

Click here:
http://66.231.133.205/hgh/index.html
"""