[Spambayes] re-org - making a package &c.

Mark Hammond mhammond at skippinet.com.au
Tue Jan 14 12:20:05 EST 2003

> Willing to do more then just give feedback, at least I would:)
> Suppose "spambayes --slash-training-styles" would run against several
> databases, each of those databases keeping track of the probs for a
> particular training style, adding extra headers indicating if and how
> the different databases scored this particular email.  Me, I would be
> willing then to be carefull to train according to all training style
> candidates simultaniously.
> The advantage would be that we would be comparing all training methods
> on the same data.

My idea was closer to the existing test harness we have.  I was thinking of
somehow formalizing Tim's original hapax experiments.

>From my limited playing with our test harness, it seems that we simply pick
random messages from our ham and spam folders, train over these messages,
then score these messages against the trained data.  This hasn't been as
important for a few months, as the algorithm hasn't changed in that period.

What if we changed this to perform a "time ordered" selection of messages?

For example, off the top of my head, I can see 2 training candidates (there
would be a number more, but let's start with just 2):

* Do not start filtering until we have, say, 20 spam and 20 ham.  Once we
reach this threshold, we go into a little "initial training mode".  This
mode trains on the ham and spam, then scores the entire inbox.  We continue
until the user indicates there are no spams left in their inbox.

* Start filtering immediately, but only incrementally train on either
incorrect or unsure classifications.

Our test harness would be designed to test multiple strategies over our
standard corpa.  Instead of random messages, time-ordered message would be
iterated over.  Results similar to the existing ones are produced, so we can
compare results over vastly different mail stores.  IMO, it is far more
important to know the best training strategy across vastly different mail
stores than to know which strategy works best on any individual's store.

I am pretty sure this is similar to your idea, but I thought it worth
pointing out that we possibly already have some test framework we can
leverage here.


More information about the Spambayes mailing list