[Python-Dev] The first trustworthy <wink> GBayes results

Tue, 03 Sep 2002 21:06:43 -0400

[Greg Ward]
> ...
> Just how many messages fall in that grey area anyways?

Heh.  Here's the probability distribution for the 4000 ham messages in my
first test pair:

Ham distribution for this pair:
* = 67 items
  0.00 4000 ************************************************************
  2.50    0
  5.00    0
  7.50    0
 10.00    0
 12.50    0
 15.00    0
 17.50    0
 20.00    0
 22.50    0
 25.00    0
 27.50    0
 30.00    0
 32.50    0
 35.00    0
 37.50    0
 40.00    0
 42.50    0
 45.00    0
 47.50    0
 50.00    0
 52.50    0
 55.00    0
 57.50    0
 60.00    0
 62.50    0
 65.00    0
 67.50    0
 70.00    0
 72.50    0
 75.00    0
 77.50    0
 80.00    0
 82.50    0
 85.00    0
 87.50    0
 90.00    0
 92.50    0
 95.00    0
 97.50    0

That is, they *all* got a "probability score" less than 2.5% (0.025).
Here's the spam probability distribution across the same run:

Spam distribution for this pair:
* = 46 items
  0.00    5 *
  2.50    2 *
  5.00    1 *
  7.50    0
 10.00    0
 12.50    0
 15.00    1 *
 17.50    0
 20.00    1 *
 22.50    0
 25.00    2 *
 27.50    1 *
 30.00    0
 32.50    1 *
 35.00    0
 37.50    0
 40.00    0
 42.50    0
 45.00    1 *
 47.50    1 *
 50.00    1 *
 52.50    0
 55.00    0
 57.50    1 *
 60.00    3 *
 62.50    0
 65.00    2 *
 67.50    0
 70.00    0
 72.50    0
 75.00    1 *
 77.50    1 *
 80.00    0
 82.50    0
 85.00    0
 87.50    0
 90.00    3 *
 92.50    1 *
 95.00    6 *
 97.50 2715 ************************************************************

IOW, a spam usually scored at least 0.975 on this run, but some spams scored
under 0.025.  There's very little "in the middle".

I've got 19 more sets like this if you care a lot <wink>.  Here's the
aggregate across all 20 runs (each msg is counted 4 times here, once for
each of the runs in which it served in the prediction set against training
on one of the 4 spam+ham collection pairs it doesn't belong to):

Ham distribution for all runs:
* = 1333 items
  0.00 79938 ************************************************************
  2.50     8 *
  5.00     3 *
  7.50     0
 10.00     3 *
 12.50     1 *
 15.00     3 *
 17.50     1 *
 20.00     1 *
 22.50     0
 25.00     0
 27.50     0
 30.00     1 *
 32.50     4 *
 35.00     2 *
 37.50     0
 40.00     2 *
 42.50     0
 45.00     1 *
 47.50     1 *
 50.00     1 *
 52.50     0
 55.00     0
 57.50     0
 60.00     0
 62.50     1 *
 65.00     0
 67.50     0
 70.00     2 *
 72.50     0
 75.00     1 *
 77.50     1 *
 80.00     0
 82.50     0
 85.00     1 *
 87.50     1 *
 90.00     0
 92.50     1 *
 95.00     1 *
 97.50    21 *

Spam distribution for all runs:
* = 905 items
  0.00   215 *
  2.50    18 *
  5.00     8 *
  7.50    12 *
 10.00     6 *
 12.50     6 *
 15.00    14 *
 17.50     6 *
 20.00    10 *
 22.50     8 *
 25.00     9 *
 27.50     9 *
 30.00     3 *
 32.50     3 *
 35.00     5 *
 37.50     3 *
 40.00     7 *
 42.50    24 *
 45.00     3 *
 47.50    29 *
 50.00    34 *
 52.50     8 *
 55.00     6 *
 57.50    18 *
 60.00    64 *
 62.50    12 *
 65.00     7 *
 67.50     5 *
 70.00     3 *
 72.50     7 *
 75.00     4 *
 77.50    18 *
 80.00    10 *
 82.50    23 *
 85.00    13 *
 87.50    20 *
 90.00    27 *
 92.50    18 *
 95.00    57 *
 97.50 54256 ************************************************************

In percentage terms, very little lives outside the tips of the tail ends.

Note that calling the spam cutoff 0.975 instead of 0.90 would save 2 false
positives, at the expense of letting an additional 27+18+57 = 102 spams go
thru.

Here's the first example of a low-prob spam:

"""
Low prob spam! 0.0133104753792
Data/Spam/Set2/8007.txt
prob('from:email name:<janet691') = 0.5
prob('the') = 0.5
prob('subject:Fred') = 0.5
prob('you') = 0.5
prob('was') = 0.305052
prob('bool:noorg') = 0.614515
prob('proposal') = 0.100629
prob('will') = 0.557569
prob('talk') = 0.507463
prob('send') = 0.858078
prob('nice') = 0.227838
prob('from:email addr:ac') = 0.0754717
prob('from:email addr:uk>') = 0.0488301
prob('thanks,') = 0.0300188
prob('subject:Hey') = 0.99
prob('today') = 0.852792

Return-Path: <janet691@cranfield.ac.uk>
Delivered-To: bruce-spam@localhost
Received: (qmail 14409 invoked by alias); 6 Mar 2002 20:07:42 -0000
Delivered-To: spam@bruce-guenter.dyndns.org
Received: (qmail 14405 invoked from network); 6 Mar 2002 20:07:42 -0000
Received: from agamemnon.bfsmedia.com (204.83.201.2)
  by lorien.untroubled.org (192.168.1.3) with SMTP; 06 Mar 2002
20:07:42 -0000
Received: (qmail 13063 invoked by uid 500); 6 Mar 2002 20:02:05 -0000
Delivered-To: em-ca-spam@em.ca
Received: (qmail 13057 invoked by uid 502); 6 Mar 2002 20:02:05 -0000
Delivered-To: bfsmedia-goose.kennels@bfsmedia.com
Received: (qmail 13051 invoked from network); 6 Mar 2002 20:02:05 -0000
Received: from unknown (HELO smtp2.forserve.com) (63.170.11.221)
  by agamemnon.bfsmedia.com with SMTP; 6 Mar 2002 20:02:05 -0000
Date: Wed, 6 Mar 2002 15:12:41 -0500
Message-Id: <200203062012.g26KCfn08192@smtp2.forserve.com>
X-Mailer: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:0.9.1)
Gecko/20010607
Reply-To: <janet691@cranfield.ac.uk>
From: <janet691@cranfield.ac.uk>
To: <goose01977@bellsouth.net>
Subject: Hey Fred
Content-Length: 95
Lines: 9

Fred,

  It was nice to talk to you today I will send the proposal tonight.

Thanks,
 Heidi
"""

You figure it out <wink>.  I suspect bfsmedia would have added a high spam
score if I looked at Received lines, but even several additional strong spam
indicators wouldn't be enough to nail this one.  BTW, this msg shows up many
times in the spam corpora, varying the "Fred" and "Heidi" with other male
and female names; I assume this is a harvester that's trying to provoke the
recipient into replying.

Several others are damaged in ways such that the email pkg can't create a
msg out of them.  I could easily enough add code to force such a msg to be
considered spam.

Some are wildly embarrassing failures:

"""
Low prob spam! 0.000102019995919
Data/Spam/Set3/681.txt
prob('common,') = 0.01
prob('definately') = 0.01
prob('logic') = 0.01
prob('hell,') = 0.01
prob('it".') = 0.01
prob('obvious.') = 0.01
prob('theory') = 0.01
prob('whilst') = 0.01
prob('earning') = 0.99
prob('same,') = 0.01
prob('$500,000') = 0.99
prob('"bull",') = 0.99
prob('year!!!') = 0.99
prob('internet!') = 0.99
prob('tv:') = 0.99
prob('*this') = 0.99

Return-Path: <ihrockrat3213@hotmail.com>
Delivered-To: em-ca-bruceg@em.ca
Received: (qmail 25721 invoked from network); 17 Aug 2002 01:05:07 -0000
Received: from unknown (HELO 65.102.48.161) (65.102.48.161)
  by churchill.factcomp.com with SMTP; 17 Aug 2002 01:05:07 -0000
Received: from unknown (149.89.93.47) by rly-xr02.mx.aol.com with NNFMP;
Aug, 17 2002 1:50:22 AM -0800
Received: from anther.webhostingtalk.com ([88.58.121.118]) by
da001d2020.lax-ca.osd.concentric.net with QMQP; Aug, 17 2002 12:40:13
AM -0700
Received: from 34.57.158.148 ([34.57.158.148]) by rly-xr02.mx.aol.com with
local; Aug, 17 2002 12:02:05 AM +0300
From: rnpyjohn <ihrockrat3213@hotmail.com>
To: Undisclosed Recipients
Cc:
Subject: Please read this letter carefully, it works 100%
Sender: rnpyjohn <ihrockrat3213@hotmail.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="iso-8859-1"
Date: Sat, 17 Aug 2002 02:03:28 +0100
X-Mailer: The Bat! (v1.52f) Business
X-Priority: 1
Content-Length: 15985

*This is a one time mailing and this list will never be used again.*

Hi,

SEEN THIS MAIL BEFORE?,  SICK OF FINDING IT IN YOUR INBOX?   ME TOO, HONEST
I was exactly the same, till one day whilst i was complaining about how
tired
i was of seeing ...
"""

The first 16 most extreme indicators are split 9 highly in favor of ham
(.01) and 7 highly in favor of spam (.99).  If I hadn't folded case away to
let stinking conference announcements through <wink>, I expect it would have
latched on to the SCREAMING at the start instead of looking deeper.  Looking
at the To: line probably would nail this one too, as "Undisclosed
Recipients" has two 0.99 spam indicators right there.

Whatever, you *don't* want to look at msgs with a mix of just 0.99 and 0.01
thingies:  it's not all that unusual to get such an extreme mix, in spam or
ham.

this-isn't-your-father's-idea-of-probability<wink>-ly y'rs  - tim