[Spambayes] training problem?

Seth Goodman nobody at spamcop.net
Wed Dec 3 19:42:09 EST 2003

[Ryan Malayter]
> I think another key is the training of all unsures as ham or spam,
> regardless of their score. You mentioned only training unsures that were
> less than 50% for some reason, I don't know why you would do that.

Because I got around 25 unsures per day.  I was attempting to limit the
growth of the database by only training on the spam that scored lowest.

[Ryan Malayter]
> Unsure means it falls somewhere in the middle, and intuitively I think
> training on it (in either direction) will improve the probabilites that
> those tokens will push future messages towards either end, making the
> tokens "less unsure", which is what you want when you train.

This is why I proposed on the Wiki the continuous train on everything
approach with an automatic mechanism to prune the database of the tokens
associated with the oldest trained messages.  Realizing that some spam is
"trickier" and doesn't occur as often, I also suggested that misclassified
messages have their tokens stay longer according to the amount of

[Ryan Malayter]
> If you really get 5 times as much spam as you do ham, then I think you
> should take a month's worth of ham, and a month's worth of spam. Find
> some way to randomly sub-sample the month's worth of spam down to a
> number similar to the number of spam. (Sorting by the Spam score
> previously assigned the messages and choosting the lowest 1/5 might be
> an interesting way to do this, and would have you training on the
> "sneakiest" of your spam).

This is sort of what I was doing when, after initial training on all spam
within a time window, I incrementally trained only on spam with a score
lower than 50.  So far, it hasn't worked any better.  Skip also suggested a
similar approach his suggestion was to do it incrementally, which would
result in the minimum number of spam to get the desired token set.  I was
hoping to come up with an algorithm for continuous, automatic training that
had the best properties of both of these methods without requiring
periodically starting over.

Seth Goodman

