[Spambayes] Large false negative ...
Tim Peters
tim.one@comcast.net
Sat, 14 Sep 2002 03:04:51 -0400
[Skip Montanaro]
> I see extremely large false negative percentages. I didn't say anything
> earlier because I had relatively small training set sizes before
> the advent of timcv.py (400/333 ham/spam per set). I tried a run this
> evening with timcv. The results were similar (fn down to about 15% from
> around 22%).
That's a huge decrease, and the primary effect of timcv is to increase the
amount of data trained on per prediction run.
> Here's the final summary chunk from the rates.py output:
>
> total unique false pos 0
> total unique false neg 262
> average fp % 0.0
> average fn % 15.7357357357
There's not enough here, Skip. rates.py prints more stuff than that, and
does so because it's important information. timcv also prints all the
options in effect right at the start, because that's important info too.
All that showing just these 4 numbers can tell me is that, ya, your fn rate
is supernaturally high and your fp rate supernaturally low. The list of
best discriminators (provided you have that enabled -- but I can't guess
that either <wink>) often has the best clues.
> Scanning through the reported false negatives, nothing much jumped out as
> unusual except the viruses. Figuring viruses were not spam and might be
> throwing things off, I went through my spam collection and deleted all
> the obvious viruses then rebalanced the spam sets (leaving 328 per set)
> and tried again:
>
> total unique false pos 0
> total unique false neg 249
> average fp % 0.0
> average fn % 15.1829268293
>
> I'm headed in the right direction,
It would be easier to have confidence in that had you run cmp.py against the
two summary files and posted the comparison. You're effectively revealing
two numbers per test run, and that's all ("total unique" is close to 100%
correlated with "average" in timcv output, so half the numbers I'm seeing
here are redundant).
> but am nowhere close to the sorts of results Tim and others have been
> getting.
Note that I'm using a factor of about 45x more training data.
> I'd be happy with 3-4% fn.
Try using 50x more data <wink>.
> On a somewhat brighter note, I'm quite happy with the fp percentage...
You shouldn't be, though -- the extreme imbalance in rates is as suspicious
as the absolute magnitude of the fn rate. It's as if all your ham have
something trivial in common that the classifier is latching onto (wouldn't
be the first time someone got tripped up by this!), and that a fair amount
of your spam also has that. Looking at the best discriminators may reveal
something of this nature.
> Can someone with a larger collection of ham 'n spam try running
> rebal.py to get 400 ham per set and 328 spam per set, then try
> "timcv -n5" and let me know what the overall fn percentage is?
Is it the case that you're only running 5-fold c-v?
> Assuming Data is a subdirectory of the current directory and
> Data/{Ham,Spam}/reservoir are your two reservoirs, you'd execute:
>
> rebal.py -n 400 -r Data/Ham/reservoir -s Data/Ham/Set -Q
> rebal.py -n 328 -r Data/Spam/reservoir -s Data/Spam/Set -Q
>
> Files should be migrated to your reservoirs. The -Q flag just shuts up
> rebal.py. After your timcv run you can run rebal.py again with
> different -n values to restore the numbers of ham and spam you had
> before.
rebal was very helpful in rearranging my directories, and thanks for
generalizing it! For those with a larger corpus, I suggest it's easier to
fiddle MsgStream.produce to pick smaller subsets at random; e.g.,
def produce(self):
import random
keep = 'Spam' in self.directories[0] and 328 or 400
for directory in self.directories:
all = os.listdir(directory)
random.seed(hash(max(all))) # reproducible across calls
random.shuffle(all)
for fname in all[:keep]:
yield Msg(directory, fname)
Here are the options I used:
"""
[TestDriver]
save_trained_pickles = False
show_histograms = True
show_ham_lo = 1.0
show_best_discriminators = 50
show_spam_lo = 1.0
show_ham_hi = 0.0
show_false_positives = True
pickle_basename = class
show_false_negatives = True
nbuckets = 40
show_charlimit = 100000
show_spam_hi = 0.0
[Classifier]
spambias = 1.0
min_spamprob = 0.01
unknown_spamprob = 0.5
hambias = 2.0
max_discriminators = 16
max_spamprob = 0.99
[Tokenizer]
safe_headers = abuse-reports-to
date
errors-to
from
importance
in-reply-to
message-id
mime-version
organization
received
reply-to
return-path
subject
to
user-agent
x-abuse-info
x-complaints-to
x-face
mine_received_headers = False
retain_pure_html_tags = False
count_all_header_lines = False
"""
Here's the summary file:
"""
-> Training on Data/Ham/Set2-5 & Data/Spam/Set2-5 ... 1600 hams & 1312 spams
-> Predicting Data/Ham/Set1 & Data/Spam/Set1 ...
-> <stat> tested 400 hams & 328 spams against 1600 hams & 1312 spams
-> <stat> false positive %: 0.0
-> <stat> false negative %: 0.30487804878
0.000 0.305
-> <stat> 0 new false positives
-> <stat> 1 new false negatives
-> Training on Data/Ham/Set1 & Data/Spam/Set1 ... 400 hams & 328 spams
-> Forgetting Data/Ham/Set2 & Data/Spam/Set2 ... 400 hams & 328 spams
-> Predicting Data/Ham/Set2 & Data/Spam/Set2 ...
-> <stat> tested 400 hams & 328 spams against 1600 hams & 1312 spams
-> <stat> false positive %: 0.0
-> <stat> false negative %: 0.0
0.000 0.000
-> <stat> 0 new false positives
-> <stat> 0 new false negatives
-> Training on Data/Ham/Set2 & Data/Spam/Set2 ... 400 hams & 328 spams
-> Forgetting Data/Ham/Set3 & Data/Spam/Set3 ... 400 hams & 328 spams
-> Predicting Data/Ham/Set3 & Data/Spam/Set3 ...
-> <stat> tested 400 hams & 328 spams against 1600 hams & 1312 spams
-> <stat> false positive %: 0.5
-> <stat> false negative %: 0.609756097561
0.500 0.610
-> <stat> 2 new false positives
-> <stat> 2 new false negatives
-> Training on Data/Ham/Set3 & Data/Spam/Set3 ... 400 hams & 328 spams
-> Forgetting Data/Ham/Set4 & Data/Spam/Set4 ... 400 hams & 328 spams
-> Predicting Data/Ham/Set4 & Data/Spam/Set4 ...
-> <stat> tested 400 hams & 328 spams against 1600 hams & 1312 spams
-> <stat> false positive %: 0.0
-> <stat> false negative %: 0.0
0.000 0.000
-> <stat> 0 new false positives
-> <stat> 0 new false negatives
-> Training on Data/Ham/Set4 & Data/Spam/Set4 ... 400 hams & 328 spams
-> Forgetting Data/Ham/Set5 & Data/Spam/Set5 ... 400 hams & 328 spams
-> Predicting Data/Ham/Set5 & Data/Spam/Set5 ...
-> <stat> tested 400 hams & 328 spams against 1600 hams & 1312 spams
-> <stat> false positive %: 0.25
-> <stat> false negative %: 0.0
0.250 0.000
-> <stat> 1 new false positives
-> <stat> 0 new false negatives
total unique false pos 3
total unique false neg 3
average fp % 0.15
average fn % 0.182926829268
"""
It's clearly very much better than you're seeing.
Here are the score distributions:
Ham distribution for all runs:
* = 34 items
0.00 1994 ***********************************************************
2.50 0
5.00 0
7.50 0
10.00 0
12.50 0
15.00 0
17.50 0
20.00 1 *
22.50 0
25.00 0
27.50 0
30.00 0
32.50 0
35.00 0
37.50 0
40.00 1 *
42.50 0
45.00 0
47.50 0
50.00 0
52.50 0
55.00 1 *
57.50 0
60.00 0
62.50 0
65.00 0
67.50 0
70.00 0
72.50 0
75.00 0
77.50 0
80.00 0
82.50 0
85.00 0
87.50 0
90.00 0
92.50 0
95.00 0
97.50 3 *
Spam distribution for all runs:
* = 28 items
0.00 1 *
2.50 0
5.00 0
7.50 0
10.00 0
12.50 0
15.00 0
17.50 0
20.00 0
22.50 0
25.00 0
27.50 0
30.00 0
32.50 0
35.00 1 *
37.50 0
40.00 0
42.50 0
45.00 0
47.50 0
50.00 0
52.50 0
55.00 0
57.50 0
60.00 0
62.50 0
65.00 0
67.50 0
70.00 0
72.50 0
75.00 0
77.50 1 *
80.00 0
82.50 0
85.00 0
87.50 0
90.00 0
92.50 1 *
95.00 0
97.50 1636 ***********************************************************
Here are the best 20 discriminators from the last run:
'url:python' 188 0.01
'content-type:text/html' 193 0.97645
'module' 194 0.01
'header:Return-Path:2' 211 0.99
'unsubscribe' 222 0.99
'def' 235 0.01
'wrote' 238 0.0120086
'import' 255 0.01
'header:Received:7' 267 0.99
'header:In-Reply-To:1' 292 0.01
'url:gif' 304 0.99
'url:remove' 315 0.99
'header:Received:8' 333 0.99
'subject:Python' 437 0.01
'header:Errors-To:1' 454 0.0201643
'header:User-Agent:1' 471 0.01
'header:Organization:1' 634 0.0120482
'wrote:' 864 0.01
'header:X-Complaints-To:1' 885 0.01
'python' 1000 0.01