[spambayes-dev] train to exhaustion?

Fri Feb 13 13:07:42 EST 2004

> [Kenny Pitt]
> The total number of misses did not decrease between rounds 3 and 4, but
> further rounds did reduce the misses to zero.

This is undoubtedly more than you want to read, but here goes.  Kenny's
observation is expected.  What you've implemented, in signal processing
terms, is an adaptive estimator for the order of and value of a parameter
set.  You know the results with no noise (all correct classifications are
known a priori), and you are trying to come up with both the number of
parameters and their values that describes the data "best" according to some
scalar cost function.

While adaptive estimators vary wildly in how they decide what the next
iteration is, all of the ones that I've seen have something in common:  they
don't approach convergence monotonically.  For the same reason that they
have this behavior, they also are not guaranteed converge to the global
minimum cost.  Think of the cost function as a surface (it is
multi-dimensional, but I can't visualize anything beyond three) that has
numerous dips but one dip is deeper than the rest and you will probably
never find that one.  This means:

1) there will be bumps in the generally decreasing cost function as you
iterate

2) there is no proven way to distinguish a bump in the cost function from a
local or global minimum

3) there is no way to tell if a minimum in the cost function is local or
global (unless you can prove that there is a lower bound and show that
you've achieved it - good luck)

4) there is no formal proof for most algorithms that they converge at all;
people just test to get a sense for robustness; algorithms that take smaller
steps per iteration tend to converge better, though slower, and tend to get
stuck in local minima more easily; pick your poison

That being said, everyone who uses these methods comes up with criteria,
sometimes one that gives the appearance of mathematical validity and
sometimes a wild-ass heuristic, to tell them when to stop iterating.  I
wouldn't trust the result of a single iteration to tell me that I've found
one of the many minima in the cost function.  Since you are iteratively
estimating both the number of parameters you need as well as their values,
you are hopping around a surface in steps that may be wider than the minima,
so you should expect significant bumps in the convergence curve.

In the case that you have a threshold cost value that you say is good
enough, that's as easy as it gets.  If you can't achieve that required cost
threshold, you may want to go a number of iterations beyond the point where
you think you've found a minimum to make sure that you're not looking at a
"speed bump" in the middle of a long descent.  How hard you try to find out
if you're stuck in a local minimum could be based on how good your estimator
currently is, but  that's up to you.

> [Kenny Pitt]
> I guess you could correct for that by stopping if the total misses
> increases or if both ham misses and spam misses stay the same, but that
> doesn't feel quite right either.  If nothing else, it fails to account
> for Tony's original question:  "if one message was still a
> false-positive, but moved from 0.8 to 0.7, is that improving?"

There's no definitive answer.  The choice of the cost function is completely
up to you, and has a lot to do with the quality of the resulting estimator.
Though continuous cost functions, like total distance (or mean-squared
distance) from perfect classification are intellectually satisfying (I
personally prefer these), it is possible that a discrete cost function, like
the number of mis-classifications, will perform better.  The only answer is
in testing.

--

Seth Goodman