[Spambayes] Large false negative ...

Skip Montanaro skip@pobox.com
Fri, 13 Sep 2002 23:07:24 -0500


I see extremely large false negative percentages.  I didn't say anything
earlier because I had relatively small training set sizes before the advent
of timcv.py (400/333 ham/spam per set).  I tried a run this evening with
timcv.  The results were similar (fn down to about 15% from around 22%).
Here's the final summary chunk from the rates.py output:

    total unique false pos 0
    total unique false neg 262
    average fp % 0.0
    average fn % 15.7357357357

Scanning through the reported false negatives, nothing much jumped out as
unusual except the viruses.  Figuring viruses were not spam and might be
throwing things off, I went through my spam collection and deleted all the
obvious viruses then rebalanced the spam sets (leaving 328 per set) and
tried again:

    total unique false pos 0
    total unique false neg 249
    average fp % 0.0
    average fn % 15.1829268293

I'm headed in the right direction, but am nowhere close to the sorts of
results Tim and others have been getting.  I'd be happy with 3-4% fn.

On a somewhat brighter note, I'm quite happy with the fp percentage...

Can someone with a larger collection of ham 'n spam try running rebal.py to
get 400 ham per set and 328 spam per set, then try "timcv -n5" and let me
know what the overall fn percentage is?  Assuming Data is a subdirectory of
the current directory and Data/{Ham,Spam}/reservoir are your two reservoirs,
you'd execute:

    rebal.py -n 400 -r Data/Ham/reservoir -s Data/Ham/Set -Q
    rebal.py -n 328 -r Data/Spam/reservoir -s Data/Spam/Set -Q

Files should be migrated to your reservoirs.  The -Q flag just shuts up
rebal.py. After your timcv run you can run rebal.py again with different -n
values to restore the numbers of ham and spam you had before.

Skip