[spambayes-dev] train to exhaustion?

Thu Feb 12 02:27:38 EST 2004

[Skip]
> Did anyone see Gary Robinson's blog (and related pages) about
> train-to-exhaustion?  Justin Mason posted a reference on 
> the spambayes list.

Like Tim, I read it then, and then heard someone (Bill Yerazunis?) mention
it while I was watching the 2004 MIT Spam Conference webcasts.

[Skip]
> Does one of the incremental training regimens implement it under a
> different name?
> 
> Don't think so, although the fpfnunsure regime seems to
> correspond closely to one *pass* of TTE.  TTE is like running 
> fpfnunsure repeatedly, starting each pass with the trained 
> database from the end of the last pass (and starting with an 
> empty training database), until results stop improving.

By "results stop improving", do you think that the intention is that the
same number of messages are misclassified, or that the scores stop getting
better?  (ie. if one message was still a false-positive, but moved from 0.8
to 0.7, is that improving?).

I've written up a regime to do this with the incremental.py setup, or at
least I hope so :)  It's damn slow, though.  I can't get it to run at any
speed that's any good unless I only use a very recent portion (like 2 days)
of mail for the retesting.

With my data, and this setup (allowing mail to be trained more than once,
and using the latest 2 days of mail), I found it gave better results than
fpfnunsure, but still not as good as nonedge (apart from very early on, when
all sorts of weird things happen with all the regimes, and I think is an
artefact of that mail).

<http://www.massey.ac.nz/~tameyer/research/spambayes/exhaustion.html> has
graphs, a bit more in the way of write-up, and also some output from Skip's
tte.py script.

=Tony Meyer