[spambayes-dev] My experience with SpamBayes.

Thu May 13 19:46:30 EDT 2004

> I used to use SpamBayes' Outlook plugin naively, and saw poor results
> (frequent false negatives and 25% of messages when into "Junk 
> Suspects" with an imbalanced DB of >10000 messages).  Then I did the 
> following, and I'm seeing near-perfect classification with 55 ham
> and 83 spam in my DB.

[description of non-edge training snipped]

> IMHO, it's a poor piece of software that requires the user to manually
> balance a database and/or develop the expertise to manually 
> train as I did.

It's not 100% whether nonedge (like you did) or mistake-based training works
better (testing has been limited - most of it by Alex and then me), and the
idea of mistake-based training has been around longer, so is what the
plug-in is optimised for.

<http://cashew.wolfskeep.com/~popiel/spambayes>
<http://www.massey.ac.nz/~tameyer/research/spambayes/incremental.html>

What you originally did is highly unlikely to be mistake-based training, if
you had over 10,000 messages trained (and the database would probably have
been roughly balanced, too).  If you had, then you would probably have seen
results about as good as you are getting now.

FWIW, to do mistake-based training with the plug-in:

  * Don't train anything.  Everything ends up unsure.
  * Train all mistakes - this means everything unsure, all false positives,
and all false negatives.

Simple, right?  And to do the training, you only have to use either the
"Delete As"/"Recover From" buttons, or the drag-and-drop method (with the
incremental training options on).  It only takes a few messages for
classification to be good, so hardly any time is involved, and, like
nonedge, you end up with a small database.

Until it's more clear what training regime actually is best, leaving the
plug-in setup for mistake-based training seems wise, given that people are
familiar with it and it's very simple.  That said, there are plans to make
changes to make it easier to try out other training regimes - automatically
refiltering the unsure folder, for example.  However, the focus at the
moment is getting 1.0 out the door - and this means not making any major
changes, and focusing on stability.  Once work on 1.1 starts, these sorts of
things will get added in.  (Maybe some sort of train-to-exhaustion scheme,
too, which enforces balance).

Don't forget that you're using bleeding-edge software, here, along with
reasonably bleeding-edge ideas.

> These problems will not go away as long as developers 
> continue to compare results using only balanced ham+spam sets, ignoring
> the plight of the naïve user.

I hope you are not referring to the SpamBayes developers here.  If you are,
then you should consider looking at the tests that have been done.  Start
with the link to Alex's tests above, and you'll immediately find tests
looking at the effects that an imbalance has.  Look through the
spambayes-dev/spambayes archives for cross-validation testing, and consider
how many actually use balanced corpora (hint: not all).

If there's something in the documentation (etc) that you think encourages
people to train-on-everything (which is probably what you were initially
doing), then point it out and we'll address it.  Once upon a time,
train-on-everything seemed like the best thing to do, so there could easily
be legacy text that encourages it.  This is volunteer-based open-source - to
get better, people need to contribute!

=Tony Meyer