[Spambayes] Large false negative ...
Skip Montanaro
skip@pobox.com
Fri, 13 Sep 2002 23:07:24 -0500
I see extremely large false negative percentages. I didn't say anything
earlier because I had relatively small training set sizes before the advent
of timcv.py (400/333 ham/spam per set). I tried a run this evening with
timcv. The results were similar (fn down to about 15% from around 22%).
Here's the final summary chunk from the rates.py output:
total unique false pos 0
total unique false neg 262
average fp % 0.0
average fn % 15.7357357357
Scanning through the reported false negatives, nothing much jumped out as
unusual except the viruses. Figuring viruses were not spam and might be
throwing things off, I went through my spam collection and deleted all the
obvious viruses then rebalanced the spam sets (leaving 328 per set) and
tried again:
total unique false pos 0
total unique false neg 249
average fp % 0.0
average fn % 15.1829268293
I'm headed in the right direction, but am nowhere close to the sorts of
results Tim and others have been getting. I'd be happy with 3-4% fn.
On a somewhat brighter note, I'm quite happy with the fp percentage...
Can someone with a larger collection of ham 'n spam try running rebal.py to
get 400 ham per set and 328 spam per set, then try "timcv -n5" and let me
know what the overall fn percentage is? Assuming Data is a subdirectory of
the current directory and Data/{Ham,Spam}/reservoir are your two reservoirs,
you'd execute:
rebal.py -n 400 -r Data/Ham/reservoir -s Data/Ham/Set -Q
rebal.py -n 328 -r Data/Spam/reservoir -s Data/Spam/Set -Q
Files should be migrated to your reservoirs. The -Q flag just shuts up
rebal.py. After your timcv run you can run rebal.py again with different -n
values to restore the numbers of ham and spam you had before.
Skip