[Spambayes] Two Scheme Enter, One Scheme Leave.

Anthony Baxter anthony@interlink.com.au
Wed, 25 Sep 2002 18:09:04 +1000


This is on my mungo-corpus, (which, after it's most recent update, is
now 11300 spam and 20200 ham), but selecting only 2000/2000. I chose
a seed of 12346, just to do one better than Tim :)

Robinson defaults as supplied by Tim

  (Graham on left)

    false positive percentages
        0.500  0.500  tied
        2.500  3.500  lost   +40.00%
        1.500  2.500  lost   +66.67%
        2.500  3.000  lost   +20.00%
        2.500  3.500  lost   +40.00%
        1.000  0.500  won    -50.00%
        0.500  2.500  lost  +400.00%
        1.000  2.500  lost  +150.00%
        1.500  2.500  lost   +66.67%
        2.500  2.500  tied

    won   1 times
    tied  2 times
    lost  7 times

    total unique fp went from 32 to 47 lost   +46.88%
    mean fp % went from 1.6 to 2.35 lost   +46.88%

    false negative percentages
        0.000  0.000  tied
        1.500  0.500  won    -66.67%
        0.500  0.000  won   -100.00%
        0.500  0.000  won   -100.00%
        1.000  1.000  tied
        0.500  0.500  tied
        0.500  0.500  tied
        1.000  0.000  won   -100.00%
        0.000  0.000  tied
        0.000  0.500  lost  +(was 0)

    won   4 times
    tied  5 times
    lost  1 times

    total unique fn went from 11 to 6 won    -45.45%
    mean fn % went from 0.55 to 0.3 won    -45.45%

Raising spam_cutoff to 0.6 (the optimal value for minimum fn+fp) gives us
  (Graham on left)

    total unique fp went from 32 to 11 won    -65.62%
    mean fp % went from 1.6 to 0.55 won    -65.62%
    total unique fn went from 11 to 21 lost   +90.91%
    mean fn % went from 0.55 to 1.05 lost   +90.91%

  (default Robinson (spam_cutoff 0.550) on left)

    total unique fp went from 47 to 11 won    -76.60%
    mean fp % went from 2.35 to 0.55 won    -76.60%
    total unique fn went from 6 to 21 lost  +250.00%
    mean fn % went from 0.3 to 1.05 lost  +250.00%

So let's leave spam_cutoff at 0.6 from now on (rather than trying to
juggle 15 different parameters at once).

Summarising values tried for robinson_probability_a
(using the 0.6 cutoff)

     a      fp       fn    fp+fn
    0.0     18      650     668
    0.001   13       36      49
    0.01    13       28      41
    0.025   12       24      36
    0.05    11       23      34
    0.075   10       21      31
    0.1      9       21      30
    0.125   10       21      31
    0.15    10       21      31
    0.2      9       22      31
    0.25     9       22      31
    0.35    10       21      31
    0.45    10       22      32
    0.5     11       21      33     (tim's default)
    1.0     13       29      42
    2.0     12       42      54
    10.0    11       96      107

I have the raw run data for these, if anyone cares. They're rather large :)

So it looks like a=0.1,cutoff=0.6 is the winning combo of these two. 
  (Graham on left)
    total unique fp went from 32 to 9 won    -71.88%
    mean fp % went from 1.6 to 0.45 won    -71.88%
    total unique fn went from 11 to 21 lost   +90.91%
    mean fn % went from 0.55 to 1.05 lost   +90.91%

  (default Robinson on left)
    total unique fp went from 47 to 9 won    -80.85%
    mean fp % went from 2.35 to 0.45 won    -80.85%
    total unique fn went from 6 to 21 lost  +250.00%
    mean fn % went from 0.3 to 1.05 lost  +250.00%

Next up, the other knobs and dials!

Anthony