[spambayes-dev] Near-twin ham/spam, Train-to-exhaustion, feature ideas

David Abrahams dave at boost-consulting.com
Tue Jun 12 18:57:13 CEST 2007

When I run the tte script, it always stops after 4 or 5 rounds.  If it
goes beyond 6 rounds it's a sure sign that I've misclassified
something (**).  What I do then is run the script with -v and let it
show me the messages it's training on.  In later runs it's always
training one or two messages that are the culprits.  I just look for
those message IDs in TBird and move them into the right training sets.

I do this training automatically on my server, so what I'd like to do
is have the script automatically email me a notice identifying problem
messages in my training set.  Maybe it should even mark them deleted
(I use IMAP) and restart the process.  Thoughts?

(**) Technically speaking, running for 6 or more rounds doesn't
necessarily identify a misclassification.  Sometimes it identifies a
correctly-classified message for which there is an
oppositely-classified near-twin.  Today, it had some trouble with a
"correctly" classified-as-ham Mailman moderation request message
containing a piece of spam that I had also received directly and thus
classified as spam.  So everything was classified "correctly."  What
prompted me to look at this situation was that lots of ham started
falling into my "unsure" folder today.  So despite the fact that
everything was classified correctly, overall performance was
noticeably reduced.  I'm tempted to conclude that TTE running for more
than 5 rounds just indicates a classification that's bad for

I guess my next question is whether this near-twin classification (the
difference being that one of the messages was a moderation request) is
supposed to work well?  I guess if I have some other moderation
requests in my spam folder that could really confuse things... hmm,
straightened that out and it still didn't finish training speedily.

