[Spambayes] Is Equal Ham & Spam really the best?

Sun Jul 29 16:19:59 CEST 2007

I begin by saying that I am neither a mathematician nor a programmer.  I
do have an understanding of statistical analysis.  I am a relatively new
user of spambayes and I am greatly please the product and the results.

I have been following the discussion of training balance with great
interest.  I do not save or archive much ham and we receive 200-300 spam
emails daily.  Therefore, I did not have 1:1 mix to train spambayes on.
My training ratio is 4957 ham to 18932 spam.  Spambayes misses a few
spams, especially as the spammers change their content and format.
Spambayes has yet to incorrectly identify a ham as spam.  I have
adjusted my filtering settings for questionable emails to 80.0 for
certain spam and 8.0 for uncertain.

I am extremely pleased with spambayes and I thank all who obviously have
put so much time into developing and tweaking this product.

Carl Swofford

-----Original Message-----
From: spambayes-bounces at python.org [mailto:spambayes-bounces at python.org]
On Behalf Of Brendon Whateley
Sent: Sunday, July 29, 2007 12:56 AM
To: Mark Hammond; spambayes at python.org; skip at pobox.com
Subject: Re: [Spambayes] Is Equal Ham & Spam really the best?

Mark Hammond wrote:
>> That thread was interesting, but still runs under the assumption that
>> balanced training is the ideal.
>>     
>
> I read that thread as *demonstrating* why unbalanced training will
skew your
> results.  It makes no assumptions at all, but simply considers the
facts
>   
We are suffering a semantic disagreement here.  That thread explains how
an imbalance influences token scores, but doesn't explain one way or
another if that harms classification in the wild.  Presumably at some
point it will cause harm, but just when is unclear.
> about how spambayes works and the math behind it.  The assumptions you
refer
> to are a direct result of the facts presented there.
>
> Do you disagree with the analysis of the math in that thread?  If you
don't
> disagree, then I completely miss your point.
>
>   
>>> Google also shows anecdotal reports of poor results after an
imbalance as
>>> low as 2:1, so I don't think it would be responsible to re-evaluate
that
>>>       
Anecdotal, as you also point out is not the same as proof.
>> "responsible"?  I'm not sure what you mean.
>>     
>
> responsible: worthy of or requiring trust.
>
> In my opinion, it would be irresponsible to our users, who generally
trust
>   
I was assuming that you didn't mean that, since in no way was I
suggesting changing anything at all.  At least not until the facts
behind the statement had been reevaluated and proven to be incorrect.
> the spambayes developers, for us to give out information that current
wisdom
> says to be incorrect, especially when backed up by a solid theoretical
> understanding of why that wisdom exists.  It would be irresponsible
for us
> to change our current wisdom based on anecdotes of a single
individual,
> especially when opposite anecdotes can be easily found.
>   
I really don't know what this was in response to.  I ask a question and
you are replying as if I am demanding change.  That has never been the
spambayes way.
>   
>> I assumed that running a test to evaluate the effects of imbalance
would
>> be the way to generate or refute such evidence?
>>     
>
> One person running a test is unlikely to cut it.  If you design a
test, you
> may have luck getting others to run it against their email, in which
case
> the results will start to get interesting as the number of people
increase.
>   
Again, I never claimed that one person would "cut anything".  However,
if one person doesn't start carry the effort forward, it will never get
reevaluated.  If my efforts at building a test lead to a result
confirming the status quo, then nothing more need be said.  At least
until some other wise guy comes along and challenges it :)  Since the
nature of Spam changes over time, the results of these challenges may
also change?
>   
>> When I get back from Hawaii, I think I'll dust off the old test
corpus and
>>     
> try
>   
>> some tests. If anybody else has some test results, I'd be very
interested
>>     
> in
>   
>> seeing them.
>>     
>
> Google is your friend here - you can find many discussions about the
effects
> of imbalances, and plenty of discussions about why a single test from
a
> single user isn't a useful indicator of anything.  Searching for
anything
> Tim Peters has to say would be the most productive thing to do :)
>   
That is a good idea, since gathering info from some of the core folks
would seem to save time in developing an effective test.
>> My current thought is that getting a (very) large mount of spam with
>> very few clues results in each email results in the imbalance.  I've
>> just checked some of todays spam and some had as few as 31 clues.
With
>> so few clues, it is relatively easy for a spam message to end up with
an
>> unsure or even ham classification while the most ham is being
correctly
>> classified.  The alternative to an imbalanced training set is to find
an
>> easy way to train on extra ham, but only the ham that still has some
>> classification value to add.
>>     
>
> I'm glad that spambayes appears to work well for you with a
significant
> imbalance, but I think we've already pointed out that there is solid
> reasoning behind our position.
>   
As a long time spambayes user, promoter and sometimes contributer, I've
yet to find a user for whom spambayes does not work very well!  I
understand the theory but would like to understand the real world
implications, especially given the anecdotal evidence that imbalance
doesn't always harm performance.  Even Tim Peters has a 3:1 ratio :)

Anyway, I'm not trying to rock the boat, start a fight or anything else.
I've now got to go and pack,
Brendon.

_______________________________________________
SpamBayes at python.org
http://mail.python.org/mailman/listinfo/spambayes
Check the FAQ before asking: http://spambayes.sf.net/faq.html