[Spambayes] Better optimization loop
Rob Hooft
rob@hooft.net
Tue Nov 19 17:47:02 2002
Tim Peters wrote:
> [Rob Hooft, simplifying simplex]
>
>>...
>>I decided that we have a perfect way to optimize the ham and spam
>>cutoff values in timcv already, so that I can remove these from the
>>simplex optimization.
>
>
> Good observation! That should help. simplex isn't fast in the best of
> cases, and in this case ...
Anyone that has a faster optimization algorithm lying around is welcome
to replace my Simplex code.
>>To that goal I added a "delayed" flexcost to the CostCounter module
>>that can use the optimal cutoffs calculated at the end of timcv.py.
>
> Those can be pretty extreme; e.g., I've seen it suggest ham_cutoff of 0.99
> and spam_cutoff of 0.995 to get rid of "impossible" FP.
They are in any case better than any other alternative I could think of.
But if you disagree, you can change the order in which the
CostCounter.default() builds up the cost counters; the optimization
always uses the last one.
> It did a little better here too. The best-cost analyses show that it's also
> nuking FP at the expense of unsures:
>
> base:
>
> -> best cost for all runs: $12.80
> -> achieved at 2 cutoff pairs
> -> smallest ham & spam cutoffs 0.52 & 0.95
> -> fp 1; fn 1; unsure ham 2; unsure spam 7
> -> fp rate 0.0167%; fn rate 0.0167%; unsure rate 0.075%
> -> largest ham & spam cutoffs 0.525 & 0.95
> -> fp 1; fn 1; unsure ham 2; unsure spam 7
> -> fp rate 0.0167%; fn rate 0.0167%; unsure rate 0.075%
>
> simp:
>
> -> best cost for all runs: $12.80
> -> best cost for all runs: $11.80
> -> achieved at ham & spam cutoffs 0.495 & 0.995
> -> fp 0; fn 0; unsure ham 10; unsure spam 49
> -> fp rate 0%; fn rate 0%; unsure rate 0.492%
Very similar to my case. I'm seriously thinking about removing the
"hopeless" and "almost hopeless" messages from my corpses. I agree with
the bayesian statistics that they can't be correctly classified.
> Ya, I reported that from a paper wrestling with boosting, but it's a common
> observation. Even in simple settings! Say you're doing a least-squares
> linear regression on this data:
>
> x f(x)
> - ----
> 1 1.9
> 2 4.1
> 3 5.9
> 4 -10.0
> 5 10.1
> 6 12.1
> 7 13.8
>
> If you throw out (4, -10), you get an excellent fit to everything that
> remains. If you leave it in, you still get "an answer", but it's not a good
> fit to anything.
Press et al. report about a "robust fit", which is not a least squares
but a least absolute deviates fit. It is insensitive to outliers.
Is there an analog idea for us?
> When I try a new thing, I usually start with several runs but on *much* less
> data per run. If at least 3 of 5 show the effect I was hoping for, I may
> push on; but if 3 of 5 don't, I either give up on it, or change the rules to
> 4 of 7 (if I'm really in love with the idea <wink>).
These optimizations are very sensitive to step-functions, so I need lots
of data to run them. With a small data set it will stop wherever you
start it.
Further results I obtained: My idea of running with an fp cost of $2 and
a square cost function didn't work. It doesn't optimize to a consistent
position. Increasing the cost of an fp back to $10 and running with the
same square function did do a reasonable job, it optimized to:
[Classifier]
unknown_word_prob = 0.520415
minimum_prob_strength = 0.315104
unknown_word_strength = 0.215393
So the unknown_word_prob is now back to 0.5 again! I just committed my
changes to the optimization code, any hints on improvements are welcome.
Rob
--
Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/
More information about the Spambayes
mailing list