[Spambayes] Is Equal Ham & Spam really the best?
tim.peters at gmail.com
Sun Jul 29 05:21:18 CEST 2007
> I've just started using spambayes again after a while away from it.
> Now, 3 days in, I notice that I've trained on far more spam than ham.
> (Total emails trained: Spam: *432* Ham: *64) I seem to remember that
> this was previously my experience in the past.
> My question is; has anybody really tested the assertion that leads to
> the message: "**Warning: you have much more spam than ham - SpamBayes
> works best with approximately even numbers of ham and spam."?*
Yes, but by the time you and Tony wrote your paper, serious
multi-corpus testing had long since essentially stopped. The results
with large imbalances were so dramatically worse that I introduced the
infamous "experimental ham spam imbalance adjustment" switch, which
tried to stop "the math" from drawing absurdly confident conclusions
from wildly unbalanced data (see the thread Mark pointed out). The
results of that were a mixed bag, helping some people a little but
hurting others more, so we dropped it.
As I'm sure one of the text files in the project says, /all/ decisions
"should be" reevaluated periodically. Alas, a one-corpus test is
essentially useless, and it was hard even some years ago to arrange
for multi-corpus tests.
When the original testing was done, almost all spam was text-heavy,
meaning lots of tokens were generated. The paucity of tokens
generated for more recent image-based spam, and spam hiding in
attachments, makes SB's basic /approach/ less useful for that kind of
spam. No real idea how imbalance affects scoring spam of that kind.
The only thing I've done in response to it is lower my "spam
threshold", down to 70 now, with ham at 5. My unsure rate is about
6%, most of which are spam. Every now and again I add the 10 most
recent ham to my ham training data, but even so I've got about a 3:1
spam:ham training ratio. I do expect my stats would improve if I
added more ham (I'm one of the ones the old imbalance option helped),
but I spend so little time looking at unsures it's just not worth even
tiny efforts to improve it.
More information about the SpamBayes