[Spambayes] training problem?
rmalayter at bai.org
Wed Dec 3 17:34:20 EST 2003
> I'm amazed at your low unsure rate.
I get about 1-2 a day out of hundreds of messages in my unsure folder. I
think the 80/20 settings have a lot to do with that, the range for
unsures to occupy is narrower. I've never seen a ham score over 80, and
I've only seen a few spams that score below 20, so I feel confident with
I think another key is the training of all unsures as ham or spam,
regardless of their score. You mentioned only training unsures that were
less than 50% for some reason, I don't know why you would do that.
Unsure means it falls somewhere in the middle, and intuitively I think
training on it (in either direction) will improve the probabilites that
those tokens will push future messages towards either end, making the
tokens "less unsure", which is what you want when you train.
> When I originally trained on 650 spam and 650 ham, that
> amounted to about
> five days of spam and 26 days of ham. Now I'm wondering if
> the longer time
> frame for spam is the key. Does anyone have any thoughts on this?
If you really get 5 times as much spam as you do ham, then I think you
should take a month's worth of ham, and a month's worth of spam. Find
some way to randomly sub-sample the month's worth of spam down to a
number similar to the number of spam. (Sorting by the Spam score
previously assigned the messages and choosting the lowest 1/5 might be
an interesting way to do this, and would have you training on the
"sneakiest" of your spam).
> One small note: on the email list, you mentioned using thresholds
> of 80/20, but on the Wiki you said 90/10.
I actually use 80/20, but I didn't want to put that in the Wiki, for
fear of someone getting a false positive and calling me a jerk. At the
time I was Wikiing, I thought, the more conservative the better. But
maybe I should amend it to say what I actually use, and just warn the
More information about the Spambayes