[Spambayes] Experimental Ham/Spam imbalance setting

Fri May 23 22:46:53 EDT 2003

[Tim]
>> The fellow you're talking about has a pathologically low number of
>> ham;

[Moore, Paul]
> Hmm. I was a little worried about that possibility. The trouble is,
> it's a very similar situation to the one I'm in. I get virtually *no*
> ham (excluding mailing lists, which are filtered off before the email
> program sees them), but ridiculous amounts of spam (hundreds per
> day). I'd ignore email totally, if it wasn't for the fact that the
> few ham I do get are fairly important.
>
> I don't have any way of training on more ham - I train on it all.
>
> My current approach (which is working reasonably well) is to train on
> ham and unsures only, until I get good results, then stop *totally*.
> This has left me with a database containing 40-odd ham, and 150 spam.
> My unsure rate is tolerable, so I accept that I'm not going to do
> any better.

I don't know.  This project never did research on sub-200 msg databases, or
on highly skewed databases until late in the game.  My gut feeling is that
the sample size is indeed too small for a statistical approach to do a great
job.  Still, you *can* fiddle the ham and spam cutoffs to try to compensate,
and there a lot of other options to fiddle too.  You'd be doing new
research, of course.  In the archives you'll find reports of experiments on
fancier schemes using word n-grams (for n > 1), and I expect they could help
a lot:  word n-gram schemes certainly learn faster (fewer training msgs are
needed to get comparable results).  They didn't get pursued here because
they ran slower, needed more memory and bigger databases, and in
head-to-head tests on our typically much-larger-than-200-msg test sets
didn't do better than the highly tuned unigram scheme this project is still
using.

> I'm close to going for the other option - get a new mail account :-(

It's hard to imagine what good you're getting out of this one <wink>.

...

>> It remains experimental because the evidence was/is spotty and mixed.

> Yes, that was partly my point. As I understand things (I came into
> this after the extensive testing work had pretty much died down) it
> has become pretty much impossible to see significant test results
> now, thanks to the level of effectiveness which has been achieved.

It became flatly impossible to make any improvements on my main 50,000+ msg
test database -- there were no false negatives remaining, and the 6-or-so FP
remaining were hopeless.  Those tests were still geared toward my original
purpose, though, seeing whether this technology would work for high-volume
Mailman mailing lists.  All evidence said it would work superbly (but that
still hasn't been done).  What kinds of tweaks may work better for
individual, lower-volume inboxes didn't get nearly as much attention.

> What I see now is much more of a "real life gut feel" type of effect,
> which is nearly impossible to either quantify, or to reproduce
> reliably. Whether such evidence is useful is a difficult judgement
> call :-(

Objective results require a large variety of testers using their real life
inboxes.

> ...
> Hmm. I think I could explain this in end-user language. How does this
> sound:

Made up <wink>.

>     Compensate for unequal numbers of spam and ham
>     ----------------------------------------------
>
>     If your training database has significantly (5 times) more ham
>     than spam, or vice versa, you may start seeing an increase in
>     incorrect classifications

I've seen that a factor of 2 imbalance is enough to trigger surprises.

>     (messages put in the wrong category, not just marked as unsure). If
>     so, this option allows you to compensate for this, at the cost of
>     increasing the number of messages classified as "unsure".

Also at the cost of misclassifying msgs in the *other* direction.  So, e.g.,
setting the option True is most appropriate if you both (1) have more spam
than ham, and (2) have a deeper fear of false positives than false
negatives; or, (1a) have more ham than spam, and (2a) have a deeper fear of
false negatives than false positives.  Enabling the option is expected to
increase the Unsure rate in either case.  It's probably not the best way to
deal with imbalance either, it's just the best I could dream up at the time.

>     Note that the effect is subtle, and you should experiment with
>     both settings to choose the option that suits you best.

Why just this option?  There are *many* options under the covers, and their
effects on inboxes unlike the large and relatively balanced ones most people
tested on simply isn't known.  For example, decreasing unknown_word_strength
may help a lot on small and/or lopsided databases -- or may hurt a lot.  We
simply don't know, since it wasn't tested, and it's easy enough to make up a
*plausibility* argument either way.  Decreasing it will almost certainly
reduce the # of unsures, BTW -- but *probably* at the cost of increasing
misclassification rates.  When there's not much data to go on, it's likely
hard to get a pure win.  BTW, as you boost unknown_word_strength toward
infinity, every msg will tend toward a score of unknown_word_prob (which
defaults to 0.5).

...

>> Since mass testing here stopped, we haven't got useful feedback on
>> any of the non-default options.  Since there wasn't enough info to
>> decide about them when mass testing stopped, they still deserve a
>> chance to survive.  I hope mass testing resumes, but I can't drive
>> it (no time).  Until it does resume, the continued existence of
>> these options seems appropriate.

> Fair enough. I agree about testing, but I also don't have the time to
> do a good job (or the understanding, or the large corpus of data...)
>
> Spambayes is a victim of its own success. Theoretically, it's still
> only alpha, but we're getting a real live user base, support issues,
> the lot. I'm not sure whether to blame Microsoft for getting people
> used to the idea that alpha is as good as it gets, or the Greeks for
> not having any letters before alpha :-)

It's also a victim of economics:  the people who did most of the theoretical
"heavy lifting" (Gary Robinson, Rob Hooft, and me) aren't active here
anymore, and nobody has filled that void yet.  The things that were being
tested when I got yanked from this haven't made any progress, and MarkH's
attempt to get another test round started fizzled out (bless his heart for
trying, though!).  The protocols under which we developed this stuff (see
TESTING.txt) are solid, and when testing stopped there were still more
questions open than had been answered.  Dealing with small and/or lopsided
and/or real-life individual inboxes could be approached the same way, given
someone ruthless enough <wink> to drive it, and enough volunteer testers to
feed it.  It Would Work.

It's tedious and time-consuming work, though.  That's something I learned
from my life in commercial speech recognition:  there's plenty of clever
theory to be exploited, but making it work in real life requires an enormous
investment in data collection, cleaning, tagging and analysis, and
ruthlessness (lack of ego attachment) in letting the data tell you what is
and isn't working.  I suspect that's why no particularly good open source
speech recog program has appeared:  the huge mass of unglamorous grunt work
required doesn't attract volunteers, and most clever ideas get shot down by
the data (it's rarely an ego booster).  We got a good start on playing that
game here because my employer paid my salary to work on it at first, Gary
Robinson was borderline obsessed with dreaming up theoretically clean
foundations, and a wonderful group of testers was attracted enough by the
novelty and promise of it all to play along.

That was great while it lasted.  Deployment is more important over the long
run, but I regret that there's nothing driving the theoretical underpinnings
anymore.  BTW, I have to confess that it works so well on my personal
1,000-msg databases that I've got no incentive to try to make "spare time"
for it anymore -- to the extent spam was chafing me, my itch is thoroughly
scratched.

> Thanks for taking the time to explain all this.

Thanks for listening -- I bet you didn't expect the Spanish Inquisition
<wink>.