[Spambayes] spamprob combining

Fri, 11 Oct 2002 21:34:50 -0400

Regardless of whether the chi-squared code makes sense, I whipped up another
spamprob() variant to use it, and checked it in.  There's a new option:

[Classifier]
use_chi_squared_combining: False

This is yet another alternative to use_tim_combining (by the way, offline
Gary and I agreed that tim_combining isn't biased, but are still butting
heads over whether it's actually just a trivial transformation of
Gary-combining <wink>; scores from each are always on the same *side* of
0.5, but tim-combining scores are always at least as far from 0.5 as
Gary-combining scores, and usually significant farther -- that's why the
spread increases so dramatically).

Small test run, 10-fold CV with 400+400 in each set.  As usual when
switching combining schemes, the "won/lost" things don't make sense for the
"after" run, because the appropriate value for spam_cutoff changes.  The
before run is all-default, the after run just setting the new option true:

-> <stat> tested 400 hams & 400 spams against 3600 hams & 3600 spams
   [ditto 19 times]

false positive percentages
    0.000  0.000  tied
    0.000  0.250  lost  +(was 0)
    0.000  0.250  lost  +(was 0)
    0.000  0.000  tied
    0.250  0.500  lost  +100.00%
    0.000  0.250  lost  +(was 0)
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied

won   0 times
tied  6 times
lost  4 times

total unique fp went from 1 to 5 lost  +400.00%
mean fp % went from 0.025 to 0.125 lost  +400.00%

false negative percentages
    0.000  0.000  tied
    0.250  0.000  won   -100.00%
    0.000  0.000  tied
    0.250  0.250  tied
    0.250  0.000  won   -100.00%
    0.500  0.250  won    -50.00%
    0.000  0.000  tied
    0.250  0.000  won   -100.00%
    0.500  0.250  won    -50.00%
    0.000  0.000  tied

won   5 times
tied  5 times
lost  0 times

total unique fn went from 8 to 3 won    -62.50%
mean fn % went from 0.2 to 0.075 won    -62.50%

ham mean                     ham sdev
  27.29    0.49  -98.20%        5.80    3.68  -36.55%
  27.62    0.62  -97.76%        5.57    4.91  -11.85%
  27.25    0.66  -97.58%        5.52    5.40   -2.17%
  27.75    0.25  -99.10%        5.36    2.39  -55.41%
  27.47    0.84  -96.94%        6.07    6.78  +11.70%
  27.65    0.78  -97.18%        5.84    4.68  -19.86%
  28.00    0.75  -97.32%        5.85    4.41  -24.62%
  27.44    0.29  -98.94%        5.35    2.47  -53.83%
  27.55    0.36  -98.69%        5.31    2.66  -49.91%
  27.95    0.68  -97.57%        5.85    4.37  -25.30%

ham mean and sdev for all runs
  27.60    0.57  -97.93%        5.66    4.39  -22.44%

spam mean                    spam sdev
  82.89   99.96  +20.59%        7.17    0.48  -93.31%
  82.11   99.84  +21.59%        7.04    2.11  -70.03%
  81.34   99.93  +22.85%        7.30    0.79  -89.18%
  81.73   99.84  +22.16%        7.38    2.66  -63.96%
  82.07   99.85  +21.66%        6.78    1.85  -72.71%
  82.02   99.70  +21.56%        7.32    3.28  -55.19%
  82.03   99.91  +21.80%        7.05    1.27  -81.99%
  82.22   99.93  +21.54%        6.75    0.73  -89.19%
  82.14   99.70  +21.38%        7.50    3.27  -56.40%
  82.30   99.92  +21.41%        7.30    0.84  -88.49%

spam mean and sdev for all runs
  82.08   99.86  +21.66%        7.17    2.00  -72.11%

ham/spam mean difference: 54.48 99.29 +44.81

Stare at what happened to the means, and it's easy to see that this is more
Graham-like in its score distribution than anything we've seen since using
Graham-combining:

-> <stat> Ham scores for all runs: 4000 items; mean 0.57; sdev 4.39
-> <stat> min -2.22045e-013; median 8.33096e-009; max 100

Check out the median there:  that's extreme.

Note that one ham scored 1.0!  That's the Nigerian-scam quote, and I don't
care because it's hopeless.  It actually scored 0.999999988294.

* = 63 items
 0.0 3813 *************************************************************
 0.5   32 *
 1.0   18 *
 1.5   13 *
 2.0    6 *
 2.5    5 *
 3.0    3 *
 3.5    4 *
 4.0    7 *
 4.5    7 *
 5.0    8 *
 5.5    2 *
 6.0    2 *
 6.5    3 *
 7.0    2 *
 7.5    3 *
 8.0    4 *
 8.5    0
 9.0    4 *
 9.5    2 *
10.0    2 *
10.5    0
11.0    2 *
11.5    1 *
12.0    1 *
12.5    1 *
13.0    2 *
13.5    1 *
14.0    1 *
14.5    1 *
15.0    1 *
15.5    2 *
16.0    1 *
16.5    3 *
17.0    1 *
17.5    1 *
18.0    3 *
18.5    0
19.0    1 *
19.5    0
20.0    1 *
20.5    1 *
21.0    0
21.5    1 *
22.0    0
22.5    0
23.0    1 *
23.5    0
24.0    0
24.5    0
25.0    0
25.5    2 *
26.0    2 *
26.5    0
27.0    1 *
27.5    0
28.0    0
28.5    1 *
29.0    1 *
29.5    2 *
30.0    0
30.5    0
31.0    1 *
31.5    0
32.0    0
32.5    0
33.0    0
33.5    0
34.0    1 *
34.5    0
35.0    0
35.5    0
36.0    1 *
36.5    3 *
37.0    2 *
37.5    0
38.0    0
38.5    0
39.0    0
39.5    2 *
40.0    0
40.5    1 *
41.0    1 *
41.5    0
42.0    0
42.5    0
43.0    0
43.5    0
44.0    0
44.5    1 *
45.0    0
45.5    1 *
46.0    0
46.5    0
47.0    0
47.5    1 *
48.0    0
48.5    0
49.0    2 *
49.5    0
50.0    0
50.5    0
51.0    0
51.5    1 *
52.0    0
52.5    0
53.0    0
53.5    0
54.0    0
54.5    1 *
55.0    1 *  haven't seen this get a high score since using bigrams;
55.5    0    it's someone putting together a Python user group;
56.0    0    "fully functional", etc -- accidental spam phrases
56.5    0
57.0    0
57.5    0
58.0    0
58.5    0
59.0    0
59.5    0
60.0    0
60.5    0
61.0    0
61.5    0
62.0    0
62.5    0
63.0    1 *  "If you are interested in saving money ...": someone looking
63.5    0    to share a hotel room at a Python conference, but neglecting
64.0    0    to mention it *is* a Python conference
64.5    0
65.0    0
65.5    0
66.0    0
66.5    0
67.0    0
67.5    0
68.0    0
68.5    0
69.0    0
69.5    0
70.0    0
70.5    1 *  this is a disturbing fp -- it's not spammish at all;
71.0    0    someone looking for help writing a webmasterish program;
71.5    0    lots of accidental high-spamprob words
72.0    0
72.5    0
73.0    0
73.5    0
74.0    0
74.5    0
75.0    0
75.5    0
76.0    0
76.5    1 *  "TOOLS Europe 2000" conference announcement
77.0    0
...
99.5    1 *  Nigerian-scam quote

-> <stat> Spam scores for all runs: 4000 items; mean 99.86; sdev 2.00
-> <stat> min 46.9565; median 100; max 100
* = 65 items

Note that the *median* is 100:  that's extreme.

...
46.5    1 *  "Hello, my Name is BlackIntrepid"
47.0    0
47.5    0
48.0    0
48.5    0
49.0    0
49.5    0
50.0    0
50.5    0
51.0    0
51.5    0
52.0    1 *  "Website Programmers Available Now!"; lots of tech terms
52.5    0
53.0    0
53.5    0
54.0    1 *  This one slays me.  It has this meta tag we ignore:
             <meta name="keywords" content"free stuff, get paid for being
              online, make money on the internet, computer jobs, home
              makers, get paid to surf, mlm, work at home,
              yes you can. for time spent surfing the internet,
              everything is free, no obligation, money, FREE, ...
             and on and on.  It also has this tag we ignore:
             <meta name="Classification" content="free money, mlm,
              paid to surf, home base business, home base businesses,
              free money, online">
             It's may be the most obvious spam ever created <wink>.
54.5    0
55.0    0
55.5    0
56.0    0
56.5    0
57.0    0
57.5    0
58.0    0
58.5    0
59.0    1 *
59.5    0
60.0    2 *
60.5    0
61.0    0
61.5    0
62.0    0
62.5    0
63.0    0
63.5    0
64.0    0
64.5    0
65.0    0
65.5    0
66.0    0
66.5    0
67.0    0
67.5    0
68.0    0
68.5    0
69.0    1 *
69.5    0
70.0    0
70.5    0
71.0    0
71.5    0
72.0    0
72.5    0
73.0    0
73.5    0
74.0    0
74.5    0
75.0    1 *
            If spam_cutoff had been here, it would have matched the 8
            FN from the "before" run, and would have left only the
            Nigerian-scam and TOOLS annoucement as f-p.
75.5    0
76.0    0
76.5    0
            And if spam_cutoff had been here, the wretched TOOLS
            announcement would have gotten thru too (sorry, but
            that annoucement is spam in my eyes)
77.0    1 *
77.5    0
78.0    0
78.5    0
79.0    0
79.5    0
80.0    1 *
80.5    0
81.0    0
81.5    0
82.0    0
82.5    0
83.0    0
83.5    0
84.0    0
84.5    0
85.0    0
85.5    1 *
86.0    0
86.5    1 *
87.0    0
87.5    0
88.0    3 *
88.5    1 *
89.0    2 *
89.5    0
90.0    0
90.5    0
91.0    0
91.5    1 *
92.0    2 *
92.5    0
93.0    3 *
93.5    0
94.0    2 *
94.5    0
95.0    0
95.5    1 *
96.0    1 *
96.5    3 *
97.0    2 *
97.5    1 *
98.0    4 *
98.5    6 *
99.0    3 *
99.5 3953 *************************************************************

Looks promising, albeit uncomfortably extreme.  There's a huge and sparsely
populated middle ground where all the mistakes live, except for the hopeless
Nigerian scam quote.

Example:  if we called everything from 50 thru 80 "the middle ground", that
easily contains all but the Nigerian mistake, yet contains only 6 (of 4000
total) ham and only 8 (of 4000 total) spam.  So in a manual-review system,
this combines all the desirable properties:

1. Very little is kicked out for review.

2. There are high error rates among the msgs kicked out for review.

3. There are unmeasurably low error rates among the msgs not kicked
   out for review.

Feel encouraged to try this if you like, but keep in mind that the *point*
here is how useful the middle ground may be -- just pasting in f-p and f-n
rates without analysis (== staring at the mistakes and thinking about them)
won't help (unless they're both disasters).  It may be wise to wait for Gary
to look over my previous questions about the math -- I can't swear the
implementation even makes sense at this point.