[Spambayes] RE: For the bold

Tim Peters tim.one@comcast.net
Sat, 05 Oct 2002 20:46:32 -0400


Oops!  I misread this data badly.

> Crunching the raw data via rmspik [from the original use_central_limit]:
>
> Reading clim.pik ...
> Nham= 7500
> RmsZham= 2.93763751621
> Nspam= 7500
> RmsZspam= 3.62374621717
> ======================================================================
> HAM:
> Sure/ok       7491
> Unsure/ok     8
> Unsure/not ok 1
> Sure/not ok   0
> Unsure rate = 0.12%
> Sure fp rate = 0.00%; Unsure fp rate = 11.11%
> ======================================================================
> SPAM:
> FALSE NEGATIVE: zham=4.22 zspam=-4.08 Data/Spam/Set4/3434.txt SURE!
> FALSE NEGATIVE: zham=4.55 zspam=-3.75 Data/Spam/Set4/635.txt SURE!
> FALSE NEGATIVE: zham=4.90 zspam=-3.41 Data/Spam/Set6/12822.txt SURE!
> FALSE NEGATIVE: zham=3.18 zspam=-5.12 Data/Spam/Set7/4234.txt SURE!
> FALSE NEGATIVE: zham=4.85 zspam=-3.45 Data/Spam/Set8/975.txt SURE!
> Sure/ok       0
> Unsure/ok     0
> Unsure/not ok 7495
> Sure/not ok   5
> Unsure rate = 99.93%
> Sure fn rate = 100.00%; Unsure fn rate = 100.00%

It actually unsure about alomst 100% of the spam!  So this table's first
row:

>                  RMS ham unsure    RMS spam unsure
>                  --------------    ---------------
> central_limit                 9                  0
> central_limit2              175                 77
> central_limit3              184                227

should have said

> central_limit                 9               7495

instead.  I assume this is evidence of a bug somewhere.  Note that the hmean
and smean for a msg are always identical under the original central limit
scheme.