[Spambayes] My first run (long)

Brad Clements bkc@murkworks.com
Sun, 22 Sep 2002 11:08:12 -0400


Question: I seem to have \r\n line endings.. does that matter or should I strip them?


Description of my corpus:

All email has been received by me directly, either through my primary email address, or 
sent to various aliases (info@murkworks.com) or from email sent to employees I used 
to have.

Spam has been collected over the past year or two.. Although I have over 33,000 spam 
messages, I only took the "most recent" 13,000 to balance my ham corpus.

HAM messages are those currently in my inbox, messages requesting tech support for 
our various products, messages to mailing lists such as ftp, ntp, smtp protocol 
discussion lists, various IETF lists, "commercial-like" email from Ingram Micro, Novell, 
HP, confirmation messages from any company I purchased software from, etc. Some 
HAM is rather old .. 1995 or so.


With a cvs update from 2 hours ago..:

[Classifier]
use_robinson_probability: True
use_robinson_combining: True
max_discriminators: 1500
robinson_minimum_prob_strength: 0.1
[TestDriver]
spam_cutoff: 0.575

checking the false negatives, I have a few spams that were bounced to postmaster.. 

My false positives all seem to be "marketing" type stuff from Ingram-Micro, Novell, HP, 
E-Trade, etc.

I need to tune up my corpus...

Ham distribution for all runs:
13000 items; mean 30.27; sample sdev 7.58
* = 34 items
  0.00    0
  2.50    7 *
  5.00    7 *
  7.50   13 *
 10.00   54 **
 12.50   87 ***
 15.00  217 *******
 17.50  369 ***********
 20.00  798 ************************ 
 22.50 1367 *****************************************
 25.00 1949 **********************************************************
 27.50 2029 ************************************************************
 30.00 1861 *******************************************************
 32.50 1378 *****************************************
 35.00  916 ***************************
 37.50  733 **********************
 40.00  462 **************
 42.50  219 *******
 45.00  162 *****
 47.50  138 *****
 50.00   75 ***
 52.50   61 **
 55.00   35 **
 57.50   30 *
 60.00   12 *
 62.50    9 *
 65.00    4 *
 67.50    3 *
 70.00    0
 72.50    3 *
 75.00    2 * <- is spam
 77.50    0
 80.00    0

The .6250 spams seem to be like this:

subject: BRAD CLEMENTS, it's time to renew your subscription to EDN - Second
	Notice

(hey, that really is ham)

My Mother scored a .58, so did Borland with:

> Time is running out - Save 15% on New Delphi 7

Verisign/Netsol scores 0.589

>  Confirmation of murkworks.com renewal order

United Airlines, 0.67

> Welcome to United Connection



Spam distribution for all runs:
13000 items; mean 79.08; sample sdev 8.12
* = 38 items
  0.00    0
  2.50    0
  5.00    0
  7.50    0
 10.00    0
 12.50    0
 15.00    0
 17.50    2 *
 20.00    0
 22.50    2 *
 25.00    1 *
 27.50    2 *
 30.00    2 *
 32.50    2 *
 35.00    4 *
 37.50   49 **
 40.00   27 *
 42.50    6 *
 45.00   16 *
 47.50   19 *
 50.00   23 *
 52.50   33 *
 55.00   67 **
 57.50   92 ***
 60.00  169 *****
 62.50  232 *******
 65.00  349 **********
 67.50  502 **************
 70.00  695 *******************
 72.50  714 *******************
 75.00 1149 *******************************
 77.50 1572 ******************************************
 80.00 2179 **********************************************************
 82.50 2271 ************************************************************
 85.00 1722 **********************************************
 87.50  816 **********************
 90.00  278 ********
 92.50    5 *
 95.00    0
 97.50    0

How do I get timcv.py to dump the score of false negatives?

Here's a sample:

----

From: James <pmchuong0806@webmail.winona.edu>
To: Undisclosed.Recipients@anvil.murkworks.com
Subject: I've been so lazy
Sender: James <pmchuong0806@webmail.winona.edu>
Mime-Version: 1.0
Content-Type: text/plain; charset="iso-8859-1"
Date: Sun, 11 Aug 2002 23:04:49 -0500
X-Mailer: Microsoft Outlook Express 5.50.4133.2400

Hey how are you doing?

I've been so lazy this summer, I just got back from a fishing trip (3rd time this month).
 
And my girlfriend is wondering how I can afford to vacation all the time. "Where is all 
your money is coming from?  Are you with the mob?"  She asks.  "I'm not doing 
anything illegal", I said.  I just work about 30 minutes a week and within 60 days my 
check was $1,143!  

I told her it's this automated thing on the internet that works worldwide, I don't really 
know much about it, I just know when people sign up for free with no obligation, some 
how the system does all the work of signing people and paying me.  

It's duplication that makes it successful, they tell me.  And the best part it didn't cost me 
anything for me to be in the system.  Once there are people under me thru spill-overs, I 
upgraded.  I can't believe it cost me only $10 a month to earn $2,000 a week.

Well, if you're interested, let me know by replying to me with the subject: 
INTERESTED.  I check my messages 2 -3 times a week, so you should get something 
from me within 72 hours.  If you're not interested, ignore this message or reply with the 
subject: REMOVE to remove yourself.

Either way, I really don't care.  I know the system works because I get paid every week 
on time.  And that's good enough for me.


James
-----

    best discriminators:
        'only' 3048 0.63766
        'skip:r 10' 3057 0.765613
        'content-type:multipart/alternative' 3081 0.8886
        'offer' 3086 0.961576
        'what' 3253 0.366849
        'message' 3274 0.638722
        'now' 3370 0.676764
        'because' 3641 0.804518
        'but' 3711 0.315054
        'unsubscribe' 3737 0.960647
        'more' 3802 0.645036
        'receive' 3861 0.928872
        'new' 3958 0.603993
        'free' 4467 0.891097
        'header:Mime-Version:1' 4519 0.663075
        'url:gif' 4836 0.95745
        'one' 5274 0.697623
        'header:Reply-To:1' 5291 0.795601
        'url:murkworks' 5302 0.77701
        'get' 5611 0.638269
        'our' 5782 0.6929
        'please' 5845 0.708737
        'all' 6030 0.613003
        'here' 6347 0.879113
        'email' 6373 0.811837
        'click' 6628 0.943325
        'content-type:text/html' 6988 0.944668
        'header:Received:4' 8159 0.639928
        'content-type:text/plain' 8750 0.36458
        'x-mailer:none' 9011 0.631889


-> Training on Data/Ham/Set2-10 & Data/Spam/Set2-10 ... 11700 hams & 11700 
spams
-> Predicting Data/Ham/Set1 & Data/Spam/Set1 ...
-> <stat> tested 1300 hams & 1300 spams against 11700 hams & 11700 spams
-> <stat> false positive %: 0.692307692308
-> <stat> false negative %: 1.69230769231
      0.692   1.692
-> <stat> 9 new false positives
-> <stat> 22 new false negatives
-> Training on Data/Ham/Set1 & Data/Spam/Set1 ... 1300 hams & 1300 spams
-> Forgetting Data/Ham/Set2 & Data/Spam/Set2 ... 1300 hams & 1300 spams
-> Predicting Data/Ham/Set2 & Data/Spam/Set2 ...
-> <stat> tested 1300 hams & 1300 spams against 11700 hams & 11700 spams
-> <stat> false positive %: 0.384615384615
-> <stat> false negative %: 2.46153846154
      0.385   2.462
-> <stat> 5 new false positives
-> <stat> 32 new false negatives
-> Training on Data/Ham/Set2 & Data/Spam/Set2 ... 1300 hams & 1300 spams
-> Forgetting Data/Ham/Set3 & Data/Spam/Set3 ... 1300 hams & 1300 spams
-> Predicting Data/Ham/Set3 & Data/Spam/Set3 ...
-> <stat> tested 1300 hams & 1300 spams against 11700 hams & 11700 spams
-> <stat> false positive %: 0.384615384615
-> <stat> false negative %: 2.15384615385
      0.385   2.154
-> <stat> 5 new false positives
-> <stat> 28 new false negatives
-> Training on Data/Ham/Set3 & Data/Spam/Set3 ... 1300 hams & 1300 spams
-> Forgetting Data/Ham/Set4 & Data/Spam/Set4 ... 1300 hams & 1300 spams
-> Predicting Data/Ham/Set4 & Data/Spam/Set4 ...
-> <stat> tested 1300 hams & 1300 spams against 11700 hams & 11700 spams
-> <stat> false positive %: 0.692307692308
-> <stat> false negative %: 1.92307692308
      0.692   1.923
-> <stat> 9 new false positives
-> <stat> 25 new false negatives
-> Training on Data/Ham/Set4 & Data/Spam/Set4 ... 1300 hams & 1300 spams
-> Forgetting Data/Ham/Set5 & Data/Spam/Set5 ... 1300 hams & 1300 spams
-> Predicting Data/Ham/Set5 & Data/Spam/Set5 ...
-> <stat> tested 1300 hams & 1300 spams against 11700 hams & 11700 spams
-> <stat> false positive %: 0.461538461538
-> <stat> false negative %: 1.84615384615
      0.462   1.846
-> <stat> 6 new false positives
-> <stat> 24 new false negatives
-> Training on Data/Ham/Set5 & Data/Spam/Set5 ... 1300 hams & 1300 spams
-> Forgetting Data/Ham/Set6 & Data/Spam/Set6 ... 1300 hams & 1300 spams
-> Predicting Data/Ham/Set6 & Data/Spam/Set6 ...
-> <stat> tested 1300 hams & 1300 spams against 11700 hams & 11700 spams
-> <stat> false positive %: 0.153846153846
-> <stat> false negative %: 1.69230769231
      0.154   1.692
-> <stat> 2 new false positives
-> <stat> 22 new false negatives
-> Training on Data/Ham/Set6 & Data/Spam/Set6 ... 1300 hams & 1300 spams
-> Forgetting Data/Ham/Set7 & Data/Spam/Set7 ... 1300 hams & 1300 spams
-> Predicting Data/Ham/Set7 & Data/Spam/Set7 ...
-> <stat> tested 1300 hams & 1300 spams against 11700 hams & 11700 spams
-> <stat> false positive %: 0.461538461538
-> <stat> false negative %: 1.53846153846
      0.462   1.538
-> <stat> 6 new false positives
-> <stat> 20 new false negatives
-> Training on Data/Ham/Set7 & Data/Spam/Set7 ... 1300 hams & 1300 spams
-> Forgetting Data/Ham/Set8 & Data/Spam/Set8 ... 1300 hams & 1300 spams
-> Predicting Data/Ham/Set8 & Data/Spam/Set8 ...
-> <stat> tested 1300 hams & 1300 spams against 11700 hams & 11700 spams
-> <stat> false positive %: 0.384615384615
-> <stat> false negative %: 2.07692307692
      0.385   2.077
-> <stat> 5 new false positives
-> <stat> 27 new false negatives
-> Training on Data/Ham/Set8 & Data/Spam/Set8 ... 1300 hams & 1300 spams
-> Forgetting Data/Ham/Set9 & Data/Spam/Set9 ... 1300 hams & 1300 spams
-> Predicting Data/Ham/Set9 & Data/Spam/Set9 ...
-> <stat> tested 1300 hams & 1300 spams against 11700 hams & 11700 spams
-> <stat> false positive %: 0.769230769231
-> <stat> false negative %: 2.07692307692
      0.769   2.077
-> <stat> 10 new false positives
-> <stat> 27 new false negatives
-> Training on Data/Ham/Set9 & Data/Spam/Set9 ... 1300 hams & 1300 spams
-> Forgetting Data/Ham/Set10 & Data/Spam/Set10 ... 1300 hams & 1300 spams
-> Predicting Data/Ham/Set10 & Data/Spam/Set10 ...
-> <stat> tested 1300 hams & 1300 spams against 11700 hams & 11700 spams
-> <stat> false positive %: 0.461538461538
-> <stat> false negative %: 2.15384615385
      0.462   2.154
-> <stat> 6 new false positives
-> <stat> 28 new false negatives
total unique false pos 63
total unique false neg 255
average fp % 0.484615384615
average fn % 1.96153846154


Brad Clements,                bkc@murkworks.com   (315)268-1000
http://www.murkworks.com                          (315)268-9812 Fax
AOL-IM: BKClements