[Spambayes] spamprob combining

Sat, 12 Oct 2002 02:27:29 -0400

OK!  Gary and I exchanged info offline, and I believe the implementation of
use_chi_squared_combining matches his intent for it.

> ...
> Example:  if we called everything from 50 thru 80 "the middle
> ground", ... in a manual-review system, this combines all the
> desirable properties:
>
> 1. Very little is kicked out for review.
>
> 2. There are high error rates among the msgs kicked out for review.
>
> 3. There are unmeasurably low error rates among the msgs not kicked
>    out for review.

On my full 20,000 ham + 14,000 spam test, and with spam_cutoff 0.70, this
got 3 FP and 11 FN in a 10-fold CV run, compared to 2 FP and 11 FN under the
all-default scheme with the very touchy spam_cutoff.  The middle ground is
the *interesting* thing, and it's like a laser beam here (yippee!).  In the
"50 thru 80" range guessed at above,

1. 12 of 20,000 hams lived there, 1 of the FPs among them (scoring 0.737).
   The other 2 FP scored 0.999999929221 (Nigerian scam quote) and
   0.972986477986 (lady with the short question and long obnoxious
   employer-generated SIG).  I don't believe any usable scheme will
   ever call those ham, though, or put them in a middle ground without
   greatly bloating the middle ground with correctly classified
   messages.

2. 14 of 14,000 spams lived there, including 8 (yowza!) of the 11 FN
   (with 3 scores a bit above 0.5, 1 near 0.56, 1 near 0.58, 1 near
   0.61, 1 near 0.63, and 1 near 0.68).  The 3 remaining spam scored
   below 0.50:

0.35983017036
    "Hello, my Name is BlackIntrepid"
    Except that it contained a URL and an invitation to visit it, this
    could have been a poorly written c.l.py post explaining a bit
    about hackers to newbies (and if you don't think there are
    plenty of those in my ham, you don't read c.l.py <wink>).

0.39570232415
    The embarrassing "HOW TO BECOME A MILLIONAIRE IN WEEKS!!" spam,
    whose body consists of a uuencoded text file we throw away
    unlooked at.  (This is quite curable, but I doubt it's worth
    the bother -- at least until spammers take to putting everything
    in uuencoded text files!)

0.499567195859 (about as close to "middle ground" cutoff as can be)
    A giant (> 20KB) base64-encoded plain text file.  I've never
    bothered to decode this to see what it says; like the others,
    though, it's been a persistent FN under all schemes.  Note that
    we do decode this; I've always assumed it's of the "long, chatty,
    just-folks" flavor of tech spam that's hard to catch; the list of
    clues contains "cookies", "editor", "ms-dos", "backslashes",
    "guis", "commands", "folder", "dumb", "(well,", "cursor",
    and "trick" (a spamprob 0.00183748 word!).

For my original purpose of looking at a scheme for c.l.py traffic, this has
become the clear leader among all schemes:  while it's more extreme than I
might like, it made very few errors, and a miniscule middle ground (less
than 0.08% of all msgs) contains 64+% of all errors.  3 FN would survive,
and 2 FP, but I don't expect that any usable scheme could do better on this
data.  Note that Graham combining was also very extreme, but had *no* usable
middle ground on this data:  all mistakes had scores of almost exactly 0.0
or almost exactly 1.0 (and there were more mistakes).

How does it do for you?  An analysis like the above is what I'm looking for,
although it surely doesn't need to be so detailed.  Here's the .ini file I
used:

"""
[Classifier]
use_chi_squared_combining: True

[TestDriver]
spam_cutoff: 0.70

nbuckets: 200
best_cutoff_fp_weight: 10

show_false_positives: True
show_false_negatives: True
show_best_discriminators: 50
show_spam_lo = 0.40
show_spam_hi = 0.80
show_ham_lo = 0.40
show_ham_hi = 0.80
show_charlimit: 100000
"""

Your best spam_cutoff may be different, but the point to this exercise isn't
to find the best cutoff, it's to think about the middle ground.  Note that I
set

   show_{ham,spam}_{lo,hi}

to values such that I would see every ham and spam that lived in my presumed
middle ground of 0.50-0.80, plus down to 0.40 on the low end.   I also set
show_charlimit to a large value so that I'd see the full text of each such
msg.

Heh:  My favorite:  Data/Ham/Set7/51781.txt got overall score 0.485+, close
to the middle ground cutoff.  It's a msg I posted 2 years ago to the day (12
Sep 2000), and consists almost entirely of a rather long transcript of part
of the infamous Chicago Seven trial:

    http://www.law.umkc.edu/faculty/projects/ftrials/Chicago7/chicago7.html

I learned two things from this <wink>:

1. There are so many unique lexical clues when I post a thing, I can
   get away with posting anything.

2. "tyranny" is a spam clue, but "nazi" a ham clue:

      prob('tyranny') = 0.850877
      prob('nazi')    = 0.282714

leaving-lexical-clues-amid-faux-intimations-of-profundity-ly y'rs  - tim