[Spambayes] Is Equal Ham & Spam really the best?
mhammond at skippinet.com.au
Sun Jul 29 03:26:52 CEST 2007
> Mark Hammond wrote:
> >> That is high relative to the conventional wisdom, but I'm
> >> the correctness of that wisdom.
> > Check out this thread, which should give you a reasonable idea:
> That thread was interesting, but still runs under the assumption that
> balanced training is the ideal.
I read that thread as *demonstrating* why unbalanced training will skew your
results. It makes no assumptions at all, but simply considers the facts
about how spambayes works and the math behind it. The assumptions you refer
to are a direct result of the facts presented there.
Do you disagree with the analysis of the math in that thread? If you don't
disagree, then I completely miss your point.
>> Google also shows anecdotal reports of poor results after an imbalance as
>> low as 2:1, so I don't think it would be responsible to re-evaluate that
> "responsible"? I'm not sure what you mean.
responsible: worthy of or requiring trust.
In my opinion, it would be irresponsible to our users, who generally trust
the spambayes developers, for us to give out information that current wisdom
says to be incorrect, especially when backed up by a solid theoretical
understanding of why that wisdom exists. It would be irresponsible for us
to change our current wisdom based on anecdotes of a single individual,
especially when opposite anecdotes can be easily found.
> I assumed that running a test to evaluate the effects of imbalance would
> be the way to generate or refute such evidence?
One person running a test is unlikely to cut it. If you design a test, you
may have luck getting others to run it against their email, in which case
the results will start to get interesting as the number of people increase.
> When I get back from Hawaii, I think I'll dust off the old test corpus and
> some tests. If anybody else has some test results, I'd be very interested
> seeing them.
Google is your friend here - you can find many discussions about the effects
of imbalances, and plenty of discussions about why a single test from a
single user isn't a useful indicator of anything. Searching for anything
Tim Peters has to say would be the most productive thing to do :)
> My current thought is that getting a (very) large mount of spam with
> very few clues results in each email results in the imbalance. I've
> just checked some of todays spam and some had as few as 31 clues. With
> so few clues, it is relatively easy for a spam message to end up with an
> unsure or even ham classification while the most ham is being correctly
> classified. The alternative to an imbalanced training set is to find an
> easy way to train on extra ham, but only the ham that still has some
> classification value to add.
I'm glad that spambayes appears to work well for you with a significant
imbalance, but I think we've already pointed out that there is solid
reasoning behind our position.
More information about the SpamBayes