[Spambayes] Experimental Ham/Spam imbalance setting

Moore, Paul Paul.Moore at atosorigin.com
Fri May 23 11:00:30 EDT 2003


From: Tim Peters [mailto:tim.one at comcast.net]
> [Moore, Paul]
> > I have a friend who is using the POP3 proxy for his mail. He has a
> > 10:1 spam:ham imbalance, and he's found that he gets quite a high
> > proportion of unsures (from 200 or so mails a day, over 75% of which
> > are spam). His DB contains about 1300 spam and 150 spam.
> 
> 150 ham, right (you said "spam" twice there)?  That's not much of a ham
> sample regardless of option settings, and without knowing what he set his
> unsure range to, "quite a high proportion" may be astonishing or
> inescapable.

[...]

> > My feeling is that the higher proportion of unsures, plus the
> > unresponsiveness to training, makes it an overall loss.
> 
> The fellow you're talking about has a pathologically low number of ham;

Hmm. I was a little worried about that possibility. The trouble is, it's
a very similar situation to the one I'm in. I get virtually *no* ham
(excluding mailing lists, which are filtered off before the email program
sees them), but ridiculous amounts of spam (hundreds per day). I'd
ignore email totally, if it wasn't for the fact that the few ham I do get
are fairly important.

I don't have any way of training on more ham - I train on it all.

My current approach (which is working reasonably well) is to train on
ham and unsures only, until I get good results, then stop *totally*.
This has left me with a database containing 40-odd ham, and 150 spam.
My unsure rate is tolerable, so I accept that I'm not going to do
any better.

I'm close to going for the other option - get a new mail account :-(

> > Am I right in thinking that pop3proxy has this parameter set to true?
> 
> I don't see anything to suggest that it is.  The default is still False, and
> AFAICT only Outlook2000/default_bayes_customize.ini sets it True.

You're right. I misremembered, and couldn't find the default value.

> It's a pick-your-poison thing.  If you have more spam than ham and keep this
> False, a higher false positive rate is the expected result (or a higher FN
> rate if you have more ham than spam).

(Thinks) OK, I see this.

> It remains experimental because the evidence was/is spotty and mixed.

Yes, that was partly my point. As I understand things (I came into this
after the extensive testing work had pretty much died down) it has become
pretty much impossible to see significant test results now, thanks to the
level of effectiveness which has been achieved.

What I see now is much more of a "real life gut feel" type of effect, which
is nearly impossible to either quantify, or to reproduce reliably. Whether
such evidence is useful is a difficult judgement call :-(

> > My friend has now purged his database and is starting from scratch,
> > to try to improve his results.
> 
> He should have kept the 150 most-recent spam instead.

Good point. But getting 150 new spam isn't exactly a long-term job :-(

> Nope!  We don't store spamprobs in a database, just word counts.
> experimental_ham_spam_imbalance_adjustment is used (only) in
> Classifier.probability() when a probability is (dynamically) computed.

Oh. That's good news. I could (and probably should) do some real tests, then.
(It's much easier if I don't need to retrain).

> > Maybe the option should be exposed in the UI (but that may not be
> > sensible if changing it *does* require a retrain).
> 
> For researchers that would be fine, but end users don't have a clue about
> what to do with exotic internal options.

Hmm. I think I could explain this in end-user language. How does this sound:

    Compensate for unequal numbers of spam and ham
    ----------------------------------------------

    If your training database has significantly (5 times) more ham than
    spam, or vice versa, you may start seeing an increase in incorrect
    classifications (messages put in the wrong category, not just marked
    as unsure). If so, this option allows you to compensate for this, at
    the cost of increasing the number of messages classified as "unsure".

    Note that the effect is subtle, and you should experiment with both
    settings to choose the option that suits you best.

This is always going to be an "advanced" option, so I don't see the longer
explanation as a bad thing...

> Your friend could spend his time better by collecting more ham <0.9 wink>.

It's a shame, nobody sends him any. We're both sad, unloved people :-)

> Since mass testing here stopped, we haven't got useful feedback on any of
> the non-default options.  Since there wasn't enough info to decide about
> them when mass testing stopped, they still deserve a chance to survive.  I
> hope mass testing resumes, but I can't drive it (no time).  Until it does
> resume, the continued existence of these options seems appropriate.

Fair enough. I agree about testing, but I also don't have the time to do a
good job (or the understanding, or the large corpus of data...)

Spambayes is a victim of its own success. Theoretically, it's still only
alpha, but we're getting a real live user base, support issues, the lot.
I'm not sure whether to blame Microsoft for getting people used to the idea
that alpha is as good as it gets, or the Greeks for not having any letters
before alpha :-)

Thanks for taking the time to explain all this.

Paul.



More information about the Spambayes mailing list