[Spambayes] training on very small ham sets, normal sized spamsets.

Anthony Baxter anthony@interlink.com.au
Wed Oct 30 07:36:46 2002


>>> "T. Alexander Popiel" wrote
> >So I hacked on timcv.py and msgs.py to add options 'spam-test', 
> >'spam-train', 'ham-test' and 'ham-train', to allow you to set 
> >the training set size separately to the testing set size.
> >I haven't checked this in because it will break everyone's 
> >test scripts - --spam= will no longer be distinct, and getopt
> >will gripe. Let me know if I should check this in anyway - I 
> >think it's useful, but YMMV. 
> I'd like to have it. :-)

I figured out a backwards compatible way to do it - make the new
options --SpamTrain --SpamTest &c. I'll check it in shortly.

> Cool.  Good to see someone more thorough than I am... I've
> been getting(?) sloppy.  I'm not a real statistician, and
> it shows.

Neither am I - I just know enough to hurt myself :)

> >Here's the summary-summary table:
> >ham-train  bestcost  realcost    fp%   fn% unsure%
> >        1    430.80  11498.75  56.70  0.00   26.46
> >       10    274.05   3345.10  15.76  0.03   32.06
> >       20    245.50   1855.80   8.61  0.03   22.18
> >       30    242.15   1642.90   7.64  0.00   19.23
> >       40    234.40   1154.45   5.31  0.00   15.33
> >       60    225.55    725.65   3.35  0.03    9.23
> >      100    221.05    532.40   2.46  0.03    6.61
> >      150    218.60    410.30   1.91  0.08    4.51
> >      200    179.90    199.45   0.88  0.10    3.91
> >      250    130.05    138.05   0.58  0.08    3.72
> >      300     96.80    104.25   0.41  0.15    3.38
> >      350     66.75     73.45   0.26  0.17    3.20
> >      400     63.25     69.65   0.25  0.20    2.94
> >      450     61.95     61.95   0.21  0.28    2.78
> >      500     52.50     58.05   0.20  0.23    2.63
> >      600     44.15     50.00   0.16  0.23    2.54
> >      700     37.75     41.60   0.12  0.28    2.31
> >     1000     26.20     27.80   0.06  0.28    2.09
> >     1500     19.60     24.40   0.03  0.45    2.48
> >     2000     15.50     20.70   0.00  0.45    2.70
> >     2500     15.60     21.90   0.00  0.43    2.94
> >     2700     20.60     22.80   0.00  0.50    2.97
> >
> >It seems like most of the wins come once you get up around 350, the
> >number of spam trained on. The unsure bucket actually gets a bit worse
> >as more ham is added - looking at the histograms, various bits of spam
> >are dragged downwards.
> 
> Beautiful.  It looks like the excess ham only starts hurting
> unsures after about 1000 (or about 3:1).

fns also get worse after about 2:1, and most of the wins in the fp
are there by the time you get to 3:1. So I'd say from this something
like 2:1 or 3:1 ham:spam is a good number. But, as always, YMMV.
The 'best cost' column shows something different, but it's overly
weighting fp's vs everything else (for my tastes). (yes, I can 
tweak it, but chose not to for this test). 



-- 
Anthony Baxter     <anthony@interlink.com.au>   
It's never too late to have a happy childhood.