[Spambayes] RE: For the bold

Tim Peters tim.one@comcast.net
Sat, 05 Oct 2002 21:32:10 -0400


[Tim]
> ...
> It actually unsure about alomst 100% of the spam!  So this table's first
> row:
>
>                  RMS ham unsure    RMS spam unsure
>                  --------------    ---------------
> central_limit                 9                  0
> central_limit2              175                 77
> central_limit3              184                227
>
> should have said
>
> central_limit                 9               7495
>
> instead.  I assume this is evidence of a bug somewhere.  Note
> that the hmean and smean for a msg are always identical under the
> original central limit scheme.

The stuff below changes the first line to

  central_limit                49                  11

I believe "the bug" is in rmspik.chance(), which appears to assume that a
zscore in the positive direction is an indicator of certainty.  That seems
to be true in the logarithmic central-limit schemes, but isn't true in the
original central-limit scheme.  Changing the first three lines like so:

#    if x>=0:
#        return 1.0
#    x=-x/math.sqrt(2)
    x = abs(x)/math.sqrt(2)

and rerunning rmspik leads to very different results under the original
central limit scheme:

Reading clim.pik ...
Nham= 7500
RmsZham= 2.93763751621
Nspam= 7500
RmsZspam= 3.62374621717
======================================================================
HAM:
FALSE POSITIVE: zham=6.64 zspam=-1.66 Data/Ham/Set10/107687.txt SURE!
Sure/ok       7413
Unsure/ok     79
Unsure/not ok 7
Sure/not ok   1
Unsure rate = 1.15%
Sure fp rate = 0.01%; Unsure fp rate = 8.14%
======================================================================
SPAM:
Sure/ok       7451
Unsure/ok     38
Unsure/not ok 11
Sure/not ok   0
Unsure rate = 0.65%
Sure fn rate = 0.00%; Unsure fn rate = 22.45%

All the problems with spam went away then, and ham gives it more trouble
now.  It's still certain much more often here than under the extreme
central-limit schemes, so I still suspect RMS is a better fit to the
original cl scheme (but the probability calculation has to change to
something more symmetric).

The false positive it was certain about was the lady with a brief relevant
question, and a long, obnoxious, employer-generated sig.  That's one of my
two remaining f-p under the all-default scheme too (it so happens that the
Nigerian scam quote was in the training data on these runs, so can't show up
as an f-p).