[Spambayes] RE: For the bold
Tim Peters
tim.one@comcast.net
Sat, 05 Oct 2002 21:32:10 -0400
[Tim]
> ...
> It actually unsure about alomst 100% of the spam! So this table's first
> row:
>
> RMS ham unsure RMS spam unsure
> -------------- ---------------
> central_limit 9 0
> central_limit2 175 77
> central_limit3 184 227
>
> should have said
>
> central_limit 9 7495
>
> instead. I assume this is evidence of a bug somewhere. Note
> that the hmean and smean for a msg are always identical under the
> original central limit scheme.
The stuff below changes the first line to
central_limit 49 11
I believe "the bug" is in rmspik.chance(), which appears to assume that a
zscore in the positive direction is an indicator of certainty. That seems
to be true in the logarithmic central-limit schemes, but isn't true in the
original central-limit scheme. Changing the first three lines like so:
# if x>=0:
# return 1.0
# x=-x/math.sqrt(2)
x = abs(x)/math.sqrt(2)
and rerunning rmspik leads to very different results under the original
central limit scheme:
Reading clim.pik ...
Nham= 7500
RmsZham= 2.93763751621
Nspam= 7500
RmsZspam= 3.62374621717
======================================================================
HAM:
FALSE POSITIVE: zham=6.64 zspam=-1.66 Data/Ham/Set10/107687.txt SURE!
Sure/ok 7413
Unsure/ok 79
Unsure/not ok 7
Sure/not ok 1
Unsure rate = 1.15%
Sure fp rate = 0.01%; Unsure fp rate = 8.14%
======================================================================
SPAM:
Sure/ok 7451
Unsure/ok 38
Unsure/not ok 11
Sure/not ok 0
Unsure rate = 0.65%
Sure fn rate = 0.00%; Unsure fn rate = 22.45%
All the problems with spam went away then, and ham gives it more trouble
now. It's still certain much more often here than under the extreme
central-limit schemes, so I still suspect RMS is a better fit to the
original cl scheme (but the probability calculation has to change to
something more symmetric).
The false positive it was certain about was the lady with a brief relevant
question, and a long, obnoxious, employer-generated sig. That's one of my
two remaining f-p under the all-default scheme too (it so happens that the
Nigerian scam quote was in the training data on these runs, so can't show up
as an f-p).