[Spambayes] Is Equal Ham & Spam really the best?
spambayes at whateley.com
Sun Jul 29 08:06:54 CEST 2007
Tim Peters wrote:
> [Brendon Whateley]
>> I've just started using spambayes again after a while away from it.
>> Now, 3 days in, I notice that I've trained on far more spam than ham.
>> (Total emails trained: Spam: *432* Ham: *64) I seem to remember that
>> this was previously my experience in the past.
>> My question is; has anybody really tested the assertion that leads to
>> the message: "**Warning: you have much more spam than ham - SpamBayes
>> works best with approximately even numbers of ham and spam."?*
> Yes, but by the time you and Tony wrote your paper, serious
> multi-corpus testing had long since essentially stopped. The results
> with large imbalances were so dramatically worse that I introduced the
> infamous "experimental ham spam imbalance adjustment" switch, which
> tried to stop "the math" from drawing absurdly confident conclusions
> from wildly unbalanced data (see the thread Mark pointed out). The
> results of that were a mixed bag, helping some people a little but
> hurting others more, so we dropped it.
Yes I remember that. I can also guess why serious multi-corpus testing
stopped... as I recall, the pain of putting them together is not for the
faint of heart :)
> As I'm sure one of the text files in the project says, /all/ decisions
> "should be" reevaluated periodically. Alas, a one-corpus test is
> essentially useless, and it was hard even some years ago to arrange
> for multi-corpus tests.
In the worst case, I can satisfy my own curiosity and possibly provide
some insight. I may be able to gather several different corpora for
some testing. How many separate corpora would you consider a valid test?
> When the original testing was done, almost all spam was text-heavy,
> meaning lots of tokens were generated. The paucity of tokens
> generated for more recent image-based spam, and spam hiding in
> attachments, makes SB's basic /approach/ less useful for that kind of
> spam. No real idea how imbalance affects scoring spam of that kind.
That is the thinking that lead to my question of the imbalance effect.
Perhaps some method of generating tokens from images would restore order
to our world.
> The only thing I've done in response to it is lower my "spam
> threshold", down to 70 now, with ham at 5. My unsure rate is about
> 6%, most of which are spam. Every now and again I add the 10 most
> recent ham to my ham training data, but even so I've got about a 3:1
> spam:ham training ratio. I do expect my stats would improve if I
> added more ham (I'm one of the ones the old imbalance option helped),
> but I spend so little time looking at unsures it's just not worth even
> tiny efforts to improve it.
At the very least I can test your approach vs what I've been doing which
is to just let the imbalance grow until some ham gets pulled into
unsure. At that point I add unsure ham and continue on. At the very
least, that answer may be of some help to those who find their training
leads to large imbalances.
When I get back, I'll start playing with this and see if anything useful
More information about the SpamBayes