[Spambayes] Better optimization loop

Wed Nov 20 05:01:51 2002

[Tim]
>> Good observation!  That should help.  simplex isn't fast in the best of
>> cases, and in this case ...

[Rob Hooft]
> Anyone that has a faster optimization algorithm lying around is welcome
> to replace my Simplex code.

Twasn't a criticism, just an observation about downhill Simplex, in anyone's
implementation.  Multidimensional optimization is a darned hard problem, and
this approach is at least pretty robust.

>>> To that goal I added a "delayed" flexcost to the CostCounter module
>>> that can use the optimal cutoffs calculated at the end of timcv.py.

>> Those can be pretty extreme; e.g., I've seen it suggest ham_cutoff of
>> 0.99 and spam_cutoff of 0.995 to get rid of "impossible" FP.

> They are in any case better than any other alternative I could think of.
> But if you disagree, you can change the order in which the
> CostCounter.default() builds up the cost counters; the optimization
> always uses the last one.

I don't disagree.  The point was that the "optimal cutoffs" are *also*
working like mad to accommodate outliers at the expense of everything else.
So long as FP are viewed as an approximation to the end of the world, all
attempts to optimize settings are going to focus on them.

> ...
> Very similar to my case. I'm seriously thinking about removing the
> "hopeless" and "almost hopeless" messages from my corpses. I agree with
> the bayesian statistics that they can't be correctly classified.

Whether it's a good idea to remove them depends on the goal <wink>.  I keep
mine in my test data so that the error rates reflect real life.  But there
are about 10 ham in my c.l.py data I simply don't care about, and it doesn't
bother me a bit if they pop back into my FP set (indeed, the last few rounds
of changes boosted my c.l.py total from 1 FP to 3 FP -- BFD!  FP Happen, and
the last few round of changes had helpful effects on almost everything
else).  In that sense, it's wholly unrealistic (but perhaps pragmatically
necessary) to say that each FP (and FN, and Unsure) has exactly the same
cost as every other.  Some FP simply don't matter, while others matter a
lot.  Likewise, I find some kinds of spam much more irritating than others,
and although my c.l.py data has no FN remaining, there are about 50 spam
there I really enjoy so I'd like to penalize the system for not letting me
see them <wink>.

> ...
> Press et al. report about a "robust fit", which is not a least squares
> but a least absolute deviates fit. It is insensitive to outliers.
> Is there an analog idea for us?

I don't know, but am not sanguine:  there's a specific cost function we're
trying to minimize, and despite that it's unrealistic it's better than
nothing.  Introducing this cost measure was a real help!  Trying to squeeze
the last penny out of it probably isn't, though -- it's not that good a
model of reality.  It does *generally* help us by saying FP are worse than
FN are worse than Unsure, and attaching a concrete figure of merit to that
aggregate judgment, but I don't take that number as more than an indicator
where "a lot smaller is better".  Small changes in it don't bother or cheer
me.

> ...
> Further results I obtained: My idea of running with an fp cost of $2 and
> a square cost function didn't work. It doesn't optimize to a consistent
> position. Increasing the cost of an fp back to $10 and running with the
> same square function did do a reasonable job, it optimized to:
>
> [Classifier]
> unknown_word_prob = 0.520415
> minimum_prob_strength = 0.315104
> unknown_word_strength = 0.215393
>
> So the unknown_word_prob is now back to 0.5 again!

More, I bet 0.52 is closer to the true unknown-word probability in your data
(take all the words that have appeared at least, say, 5 times, and average
their spamprobs; that's about the best guess we can make for the spamprob of
a word we see for the first time; in the three corpora I measured this on,
0.52 was the smallest empirical value I saw).  The other two act to look
only at very extreme words, and to keep words extreme longer in the face of
contrary evidence (a hapax is strong enough to survive minimum_prob_strength
of 0.3 even with s at the default 0.45; they're even more extreme at s
0.22).  Guessing "the true spamprob" may have room for improvement.  OTOH,
if you have more ham than spam, then x=0.52 is acting to make things "less
hammy", and a benefit may come from that.  In that case, enabling the new
ham/spam imbalance adjustment option may help even more.