[Spambayes] Mixed combining

Tim Peters tim.one@comcast.net
Sat Oct 19 06:10:17 2002


[T. Alexander Popiel]
> I did two runs of the mixed combining.  Data is not yet indexed
> on my website; perhaps tomorrow.
>
> By my results, mixed spamprob is effectively neutral compared to
> straight chi-squared.  The best cost is better, but how to achieve
> those costs is no clearer than before.  The fp & fn counts are
> lower, but at a cost of about half again more unsures.  I guess
> it all depends on how you assign your costs.

I've run some more experiments of my own, and I'm embarrassed <wink> to
agree that indeed straight chi-squared did just as well, and that cutoffs
got fuzzier under mixed combining, and that Yet Another Parameter to fiddle
(the chi weight) was more Yet Another PITA (Parameter In The Ass) than
anything else.  Chalk it up to youthful enthusiasm -- I should follow my own
advice and just give up on my two miserable FP.

> Anyway, here's the tables:
>
> Mixed, .9 chi-squared, 0.10-0.90 unsure:
> -> <stat> tested 50 hams & 200 spams against 450 hams & 1800 spams
> [...]
> -> <stat> tested 200 hams & 50 spams against 1800 hams & 450 spams
> ham:spam:   50-200  75-175 100-150 125-125 150-100  175-75  200-50
> fp total:        2       3       3       3       3       2       2
> fp %:         0.40    0.40    0.30    0.24    0.20    0.11    0.10
> fn total:        5       6       4       5       6       7       9
> fn %:         0.25    0.34    0.27    0.40    0.60    0.93    1.80
> unsure t:       46      44      45      42      52      51      52
> unsure %:     1.84    1.76    1.80    1.68    2.08    2.04    2.08
> real cost:  $34.20  $44.80  $43.00  $43.40  $46.40  $37.20  $39.40
> best cost:  $28.60  $28.20  $34.00  $33.20  $34.20  $30.40  $23.80
> h mean:       3.61    2.70    2.47    2.30    2.29    2.21    1.99
> h sdev:       8.09    6.15    6.13    5.93    6.13    5.84    4.79
> s mean:      97.08   96.69   96.33   95.84   94.94   94.34   92.25
> s sdev:       6.48    7.71    8.63   10.21   12.73   13.67   17.09
> mean diff:   93.47   93.99   93.86   93.54   92.65   92.13   90.26
> k:            6.42    6.78    6.36    5.80    4.91    4.72    4.13

This is a nice way to present summary info.  Are these produced by your
table2.py?  If so, I know where to find that -- would you consider
contributing it to the project?

> ...