[Spambayes] training on very small ham sets, normal sized
spamsets.
Anthony Baxter
anthony@interlink.com.au
Wed Oct 30 07:36:46 2002
>>> "T. Alexander Popiel" wrote
> >So I hacked on timcv.py and msgs.py to add options 'spam-test',
> >'spam-train', 'ham-test' and 'ham-train', to allow you to set
> >the training set size separately to the testing set size.
> >I haven't checked this in because it will break everyone's
> >test scripts - --spam= will no longer be distinct, and getopt
> >will gripe. Let me know if I should check this in anyway - I
> >think it's useful, but YMMV.
> I'd like to have it. :-)
I figured out a backwards compatible way to do it - make the new
options --SpamTrain --SpamTest &c. I'll check it in shortly.
> Cool. Good to see someone more thorough than I am... I've
> been getting(?) sloppy. I'm not a real statistician, and
> it shows.
Neither am I - I just know enough to hurt myself :)
> >Here's the summary-summary table:
> >ham-train bestcost realcost fp% fn% unsure%
> > 1 430.80 11498.75 56.70 0.00 26.46
> > 10 274.05 3345.10 15.76 0.03 32.06
> > 20 245.50 1855.80 8.61 0.03 22.18
> > 30 242.15 1642.90 7.64 0.00 19.23
> > 40 234.40 1154.45 5.31 0.00 15.33
> > 60 225.55 725.65 3.35 0.03 9.23
> > 100 221.05 532.40 2.46 0.03 6.61
> > 150 218.60 410.30 1.91 0.08 4.51
> > 200 179.90 199.45 0.88 0.10 3.91
> > 250 130.05 138.05 0.58 0.08 3.72
> > 300 96.80 104.25 0.41 0.15 3.38
> > 350 66.75 73.45 0.26 0.17 3.20
> > 400 63.25 69.65 0.25 0.20 2.94
> > 450 61.95 61.95 0.21 0.28 2.78
> > 500 52.50 58.05 0.20 0.23 2.63
> > 600 44.15 50.00 0.16 0.23 2.54
> > 700 37.75 41.60 0.12 0.28 2.31
> > 1000 26.20 27.80 0.06 0.28 2.09
> > 1500 19.60 24.40 0.03 0.45 2.48
> > 2000 15.50 20.70 0.00 0.45 2.70
> > 2500 15.60 21.90 0.00 0.43 2.94
> > 2700 20.60 22.80 0.00 0.50 2.97
> >
> >It seems like most of the wins come once you get up around 350, the
> >number of spam trained on. The unsure bucket actually gets a bit worse
> >as more ham is added - looking at the histograms, various bits of spam
> >are dragged downwards.
>
> Beautiful. It looks like the excess ham only starts hurting
> unsures after about 1000 (or about 3:1).
fns also get worse after about 2:1, and most of the wins in the fp
are there by the time you get to 3:1. So I'd say from this something
like 2:1 or 3:1 ham:spam is a good number. But, as always, YMMV.
The 'best cost' column shows something different, but it's overly
weighting fp's vs everything else (for my tastes). (yes, I can
tweak it, but chose not to for this test).
--
Anthony Baxter <anthony@interlink.com.au>
It's never too late to have a happy childhood.