[Spambayes] understanding high false negative rate

Jeremy Hylton jeremy@alum.mit.edu
Fri, 6 Sep 2002 16:03:28 -0400


I've tried to do some testing with some personal collections of ham
and spam.  I'm seeing very high false negative rates. 20-30% is
typical.  The false positive rate is 0-3%.  (Finally!  I had to scrub
a bunch of previously unnoticed spam from my inbox.)  Both collections
have about 1100 messages.

I'd like to figure out why my false negative rate is so high, but I'm
not sure what details I should look at to diagnose.  I'm assuming that
mboxtest.py is basically correct, but it could have bugs.

One possibility is that my ham test set isn't nearly so useful as the
python-list, since it isn't focused on a single topic.  I've got some
python email, personal correspondence, questions about my Shakespeare
web site, and a few email newsletters I get on a regular basis.  I've
got receipts from various online order sites, mail from the company
that manages my student loans, etc.  Maybe the great variety in my
non-spam email makes it harder to find good discriminators for spam?

Here's a sample spam distribution from a test run:

Spam distribution for this pair:
* = 3 items
  0.00  73 *************************
  2.50   0 
  5.00   2 *
  7.50   0 
 10.00   0 
 12.50   1 *
 15.00   0 
 17.50   1 *
 20.00   1 *
 22.50   0 
 25.00   2 *
 27.50   0 
 30.00   0 
 32.50   0 
 35.00   0 
 37.50   0 
 40.00   0 
 42.50   0 
 45.00   0 
 47.50   0 
 50.00   0 
 52.50   0 
 55.00   0 
 57.50   1 *
 60.00   0 
 62.50   1 *
 65.00   0 
 67.50   0 
 70.00   1 *
 72.50   0 
 75.00   0 
 77.50   0 
 80.00   2 *
 82.50   2 *
 85.00   2 *
 87.50   0 
 90.00   4 **
 92.50   1 *
 95.00   5 **
 97.50 127 *******************************************

And here's a sample false negative.  (I'll quote the report so it
stands out.)  One thing I don't understand is how the spam probability
for the message is so low, when there are several high indicators and
few low indicators.

> Low prob spam! 1.64654685184e-11
> /home/jeremy/Mail/spam:242 subject: your web site has been mapped
> prob('millions') = 0.99
> prob('skip:= 40') = 0.99
> prob('"remove"') = 0.99
> prob('from:email addr:mail') = 0.99
> prob('email addr:alum') = 0.01
> prob('status') = 0.01
> prob('connected') = 0.01
> prob('returning') = 0.01
> prob('from:email addr:com>') = 0.224056
> prob('every') = 0.789741
> prob('charges') = 0.208406
> prob('free') = 0.818103
> prob('survey.') = 0.14931
> prob('officer') = 0.208406
> prob('its') = 0.155044
> prob('added') = 0.133131
> prob('current') = 0.152639
> prob('email addr:mit') = 0.01
> prob('wide') = 0.0911528
> prob('mark') = 0.136416
> prob('survey') = 0.0850202
> prob('http1:asp') = 0.88055
> prob("i'd") = 0.0470418
> prob('notices') = 0.01
> 
> From VM Mon Jul 24 10:05:39 2000
> Return-Path: <undeliverables@mail.internetseer.com>
> Message-ID: <0112a1010021870MARS1@mars1.internetseer.com>
> Status: RO
> From: "InternetSeer.com" <services@mail.internetseer.com>
> To: jeremy@alum.mit.edu
> Subject: Your web site has been mapped
> Date: 23 Jul 2000 22:10:11 -0400
> 
> Freewire has added your web site to its map of the World Wide Web.  F=
reewire will continue to monitor millions of links and web sites every day during its ongoing web survey.
> 
> If it is important for you to know that your site is connected to the=
 web at all times, Freewire has arranged with InternetSeer.com to notify you when your site does not respond.  This means that, AT NO CHARGE; InternetSeer.com will monitor your Web site every hour and send notification to you by email whenever your site is not connected to the Web. There are NO current or future charges associated with this service.
> 
> To begin your FREE monitoring NOW, activate your account at:
> http://www.internetseer.com/signup.asp?email=jeremy@alum.mit.edu
> 
> Mark McLellan
> Chief Technology Officer
> Freewire.com
> 
> Is your web site status important to you? I'd love your comments. If =
you prefer not to receive any future notices that result from our ongoing survey please let me know by returning this email with the word "remove" in the subject line.
> 
> ========================
======================
> ##Remove: jeremy@alum.mit.edu##
> 
> 

Jeremy