[Spambayes] Large false negative ...

Sat, 14 Sep 2002 07:56:46 -0500

    >> Here's the final summary chunk from the rates.py output:

    Tim> There's not enough here, Skip.  rates.py prints more stuff than
    Tim> that, and does so because it's important information.  

Okay, I've tried to mimic your settings and report more information below.
Here are my latest settings:

    [TestDriver]
    save_trained_pickles = False
    show_histograms = True
    show_ham_lo = 1.0
    show_best_discriminators = 50
    show_spam_lo = 1.0
    show_ham_hi = 0.0
    show_false_positives = True
    pickle_basename = class
    show_false_negatives = True
    nbuckets = 40
    show_charlimit = 3000
    show_spam_hi = 0.0

    [Classifier]
    spambias = 1.0
    min_spamprob = 0.01
    unknown_spamprob = 0.5
    hambias = 2.0
    max_discriminators = 16
    max_spamprob = 0.99

    [Tokenizer]
    safe_headers = abuse-reports-to
            date
            errors-to
            from
            importance
            in-reply-to
            message-id
            mime-version
            organization
            received
            reply-to
            return-path
            subject
            to
            user-agent
            x-abuse-info
            x-complaints-to
            x-face
    mine_received_headers = False
    retain_pure_html_tags = False
    count_all_header_lines = False

    Tim> Note that I'm using a factor of about 45x more training data.

    >> I'd be happy with 3-4% fn.

    Tim> Try using 50x more data <wink>.

I would have thought that I'd be within spitting distance of the sorts of
numbers you reported with the 5x5 scheme.  I'm training on 1600 hams and
1300 spams.  You were training on 4000 hams and 2750 spams I believe.

    >> On a somewhat brighter note, I'm quite happy with the fp percentage...

    Tim> You shouldn't be, though -- the extreme imbalance in rates is as
    Tim> suspicious as the absolute magnitude of the fn rate.  It's as if
    Tim> all your ham have something trivial in common that the classifier
    Tim> is latching onto (wouldn't be the first time someone got tripped up
    Tim> by this!), and that a fair amount of your spam also has that.
    Tim> Looking at the best discriminators may reveal something of this
    Tim> nature.

Here's the set of best discriminators from my latest run:

        'x-mailer:none' 96 0.481519
        'header:MIME-Version:1' 97 0.282956
        'subject:GB2312' 97 0.99
        'idaho' 100 0.01
        '8bit%:92' 101 0.99
        'url:dolphin' 102 0.01
        'url:listinfo' 102 0.0329182
        'felt' 104 0.01
        'header:Subject:1' 104 0.497099
        'kept)' 104 0.01
        '(text' 105 0.01
        'cedu.' 109 0.01
        'charset:gb2312' 111 0.99
        'email addr:dolphin.mojam.com' 112 0.01
        'stripmime' 112 0.01
        'subject:CEDU' 112 0.01
        'header:Message-ID:1' 116 0.245064
        '[cedu]' 117 0.01
        'header:Errors-To:1' 117 0.0915332
        'jan' 117 0.01
        'content-type:text/plain' 118 0.405034
        'header:User-Agent:1' 118 0.01
        'workshop' 120 0.01
        'daughter' 123 0.01
        'venue' 123 0.01
        'rma' 124 0.01
        'folk' 125 0.01
        'header:To:1' 126 0.495146
        'from:email addr:aol.com' 135 0.0148768
        'para' 137 0.99
        'skip' 141 0.0122271
        'concert' 142 0.01
        'header:Message-Id:1' 146 0.60187
        'header:Received:1' 156 0.964159
        'parents' 158 0.0109178
        'ascent' 161 0.01
        'parent' 180 0.01
        '8bit%:100' 181 0.99
        'bca' 192 0.01
        'wrote:' 202 0.01
        'sent:' 213 0.01
        'montanaro' 215 0.01
        'header:In-Reply-To:1' 245 0.01
        'kids' 250 0.01
        'url:manatee' 485 0.01
        'email addr:manatee.mojam.com' 526 0.01
        'url:cedu' 570 0.01
        'email name:cedu' 600 0.01
        'subject:cedu' 685 0.01
        'cedu' 694 0.01

A few things jump out at me looking at this list:

 * There are very few 0.99's.  I would have thought some of the obvious spam
   tripwords would have made it into the set.

 * It's a bit suspicious that "dolphin" and "dolphin.mojam.com" are such a
   good ham indicators.  I haven't used it as my email machine for quite
   awhile (a couple years at least).  In fact, it's been turned off for the
   past year or so.

 * There are lots of 0.01's associated with "cedu" ("workshop", "daughter",
   "[cedu]", "idaho", etc).  I participate in three pretty disjoint
   communities on the net: Python, CEDU and old British cars.  Note that
   there are no Python terms in the discriminator set (not even "Guido" or
   "Tim").  I have a ton of unread CEDU mail I dumped into the training set,
   but have very little Python data saved.  Can the training data be
   overwhelmed by too much input from one population?

    Tim> Is it the case that you're only running 5-fold c-v?

Yes.  400 hams, now down to 325 spams per Set[1-5] directory.

    Tim> For those with a larger corpus, I suggest it's easier to fiddle
    Tim> MsgStream.produce to pick smaller subsets at random; e.g.,

    Tim>     def produce(self):
    Tim>         import random
    Tim>         keep = 'Spam' in self.directories[0] and 328 or 400
    Tim>         for directory in self.directories:
    Tim>             all = os.listdir(directory)
    Tim>             random.seed(hash(max(all))) # reproducible across calls
    Tim>             random.shuffle(all)
    Tim>             for fname in all[:keep]:
    Tim>                 yield Msg(directory, fname)

Perhaps the keep values should be settable options?

    Tim> Here's the summary file:

Here's my latest summary file:

-> Training on Data/Ham/Set2-5 & Data/Spam/Set2-5 ... 1600 hams & 1300 spams
-> Predicting Data/Ham/Set1 & Data/Spam/Set1 ...
-> <stat> tested 400 hams & 325 spams against 1600 hams & 1300 spams
-> <stat> false positive %: 0.0
-> <stat> false negative %: 16.3076923077
      0.000  16.308
-> <stat> 0 new false positives
-> <stat> 53 new false negatives
-> Training on Data/Ham/Set1 & Data/Spam/Set1 ... 400 hams & 325 spams
-> Forgetting Data/Ham/Set2 & Data/Spam/Set2 ... 400 hams & 325 spams
-> Predicting Data/Ham/Set2 & Data/Spam/Set2 ...
-> <stat> tested 400 hams & 325 spams against 1600 hams & 1300 spams
-> <stat> false positive %: 0.0
-> <stat> false negative %: 14.4615384615
      0.000  14.462
-> <stat> 0 new false positives
-> <stat> 47 new false negatives
-> Training on Data/Ham/Set2 & Data/Spam/Set2 ... 400 hams & 325 spams
-> Forgetting Data/Ham/Set3 & Data/Spam/Set3 ... 400 hams & 325 spams
-> Predicting Data/Ham/Set3 & Data/Spam/Set3 ...
-> <stat> tested 400 hams & 325 spams against 1600 hams & 1300 spams
-> <stat> false positive %: 0.0
-> <stat> false negative %: 14.7692307692
      0.000  14.769
-> <stat> 0 new false positives
-> <stat> 48 new false negatives
-> Training on Data/Ham/Set3 & Data/Spam/Set3 ... 400 hams & 325 spams
-> Forgetting Data/Ham/Set4 & Data/Spam/Set4 ... 400 hams & 325 spams
-> Predicting Data/Ham/Set4 & Data/Spam/Set4 ...
-> <stat> tested 400 hams & 325 spams against 1600 hams & 1300 spams
-> <stat> false positive %: 0.0
-> <stat> false negative %: 15.3846153846
      0.000  15.385
-> <stat> 0 new false positives
-> <stat> 50 new false negatives
-> Full commercial single (1-3 songs) released to major stores and DJ's
-> Major recording, engineering, and production from a Sonic Wave producer
-> Professional Graphic Design (industry recognized and popular graphic
-> Unlimited Manufacturing and complete distribution to all major music 
-> 15% royalty on sales
-> sales analysis and recommendations
-> professional reports and graphs.
-> Training on Data/Ham/Set4 & Data/Spam/Set4 ... 400 hams & 325 spams
-> Forgetting Data/Ham/Set5 & Data/Spam/Set5 ... 400 hams & 325 spams
-> Predicting Data/Ham/Set5 & Data/Spam/Set5 ...
-> <stat> tested 400 hams & 325 spams against 1600 hams & 1300 spams
-> <stat> false positive %: 0.0
-> <stat> false negative %: 15.3846153846
      0.000  15.385
-> <stat> 0 new false positives
-> <stat> 50 new false negatives
total unique false pos 0
total unique false neg 248
average fp % 0.0
average fn % 15.2615384615

    Tim> Here are the score distributions:

And mine:

Ham distribution for all runs:
* = 34 items
  0.00 2000 ***********************************************************
  2.50    0 
  5.00    0 
  7.50    0 
 10.00    0 
 12.50    0 
 15.00    0 
 17.50    0 
 20.00    0 
 22.50    0 
 25.00    0 
 27.50    0 
 30.00    0 
 32.50    0 
 35.00    0 
 37.50    0 
 40.00    0 
 42.50    0 
 45.00    0 
 47.50    0 
 50.00    0 
 52.50    0 
 55.00    0 
 57.50    0 
 60.00    0 
 62.50    0 
 65.00    0 
 67.50    0 
 70.00    0 
 72.50    0 
 75.00    0 
 77.50    0 
 80.00    0 
 82.50    0 
 85.00    0 
 87.50    0 
 90.00    0 
 92.50    0 
 95.00    0 
 97.50    0 

Spam distribution for all runs:
* = 23 items
  0.00  213 **********
  2.50    7 *
  5.00    3 *
  7.50    3 *
 10.00    3 *
 12.50    0 
 15.00    2 *
 17.50    0 
 20.00    1 *
 22.50    2 *
 25.00    0 
 27.50    1 *
 30.00    0 
 32.50    0 
 35.00    0 
 37.50    0 
 40.00    0 
 42.50    0 
 45.00    1 *
 47.50    0 
 50.00    0 
 52.50    2 *
 55.00    0 
 57.50    0 
 60.00    1 *
 62.50    0 
 65.00    2 *
 67.50    0 
 70.00    1 *
 72.50    1 *
 75.00    0 
 77.50    1 *
 80.00    2 *
 82.50    0 
 85.00    1 *
 87.50    1 *
 90.00    2 *
 92.50    4 *
 95.00    1 *
 97.50 1370 ************************************************************

My best guess at this point is that lots of spam goes to ancient addresses
and I've mixed in a fair amount of old (but ham) mail which was sent to such
old addresses.  Perhaps deleting old ham from the corpus will improve
things.  I'll try moving old mail out of the way temporarily and see how
that changes the results.

Skip