This sounds like it's working out pretty well! If we get to the point that it becomes the accepted technique for spambayes, I'll add it to the my online essay. NOTE: As we've discussed ad nauseum, this multipicative thing is one-sided in its sensitivity, which is why we end up having to do something like S/(S+H) where S is based on (1-p) calcs for combining the p's and H is based on p calcs. There ARE meta-analytical ways of combining the p-values which are equally sensitive on both sides... but are a TAD overall less sensitive than the chi-square thing. And frankly, the S/(S+H)-style trick may take away a lot of that super-strength super sensitivity anyway -- maybe even all of the advantage over other methods (I just don't know without directly testing it). So a two-sided combining approach may perform equally well for our practical purposes... there's no way of knowing without trying. The advantage of such an approach would essentially be algorithmic elegance. No longer would we need that klugy (P-Q)/(P+Q) or S/(S+H) stuff which doesn't convert to a real probability. Instead, the combined P would be all we would need. Combined P near 1 would be spammy, and combined P near 0 would by hammy. And P would be a REAL probability (against the null hypothesis of randomness). I wouldn't expect any performance ADVANTAGE to this other approach, but it WOULD be more elegant. (Note, all these approaches depend on one or another statistical function as the current one does the inverse-chi-square). If you are interested in going that way let me know, and I'll send info on how to do it. Maybe you'll have another beautifully simple algorithm up your sleave to implement the necessary statistical function. --Gary -- Gary Robinson CEO Transpose, LLC grobinson@transpose.com 207-942-3463 http://www.emergentmusic.com http://radio.weblogs.com/0101454
From: Tim Peters <tim.one@comcast.net> Date: Sat, 12 Oct 2002 02:27:29 -0400 To: SpamBayes <spambayes@python.org> Cc: Gary Robinson <grobinson@transpose.com> Subject: RE: [Spambayes] spamprob combining
OK! Gary and I exchanged info offline, and I believe the implementation of use_chi_squared_combining matches his intent for it.
... Example: if we called everything from 50 thru 80 "the middle ground", ... in a manual-review system, this combines all the desirable properties:
1. Very little is kicked out for review.
2. There are high error rates among the msgs kicked out for review.
3. There are unmeasurably low error rates among the msgs not kicked out for review.
On my full 20,000 ham + 14,000 spam test, and with spam_cutoff 0.70, this got 3 FP and 11 FN in a 10-fold CV run, compared to 2 FP and 11 FN under the all-default scheme with the very touchy spam_cutoff. The middle ground is the *interesting* thing, and it's like a laser beam here (yippee!). In the "50 thru 80" range guessed at above,
1. 12 of 20,000 hams lived there, 1 of the FPs among them (scoring 0.737). The other 2 FP scored 0.999999929221 (Nigerian scam quote) and 0.972986477986 (lady with the short question and long obnoxious employer-generated SIG). I don't believe any usable scheme will ever call those ham, though, or put them in a middle ground without greatly bloating the middle ground with correctly classified messages.
2. 14 of 14,000 spams lived there, including 8 (yowza!) of the 11 FN (with 3 scores a bit above 0.5, 1 near 0.56, 1 near 0.58, 1 near 0.61, 1 near 0.63, and 1 near 0.68). The 3 remaining spam scored below 0.50:
0.35983017036 "Hello, my Name is BlackIntrepid" Except that it contained a URL and an invitation to visit it, this could have been a poorly written c.l.py post explaining a bit about hackers to newbies (and if you don't think there are plenty of those in my ham, you don't read c.l.py <wink>).
0.39570232415 The embarrassing "HOW TO BECOME A MILLIONAIRE IN WEEKS!!" spam, whose body consists of a uuencoded text file we throw away unlooked at. (This is quite curable, but I doubt it's worth the bother -- at least until spammers take to putting everything in uuencoded text files!)
0.499567195859 (about as close to "middle ground" cutoff as can be) A giant (> 20KB) base64-encoded plain text file. I've never bothered to decode this to see what it says; like the others, though, it's been a persistent FN under all schemes. Note that we do decode this; I've always assumed it's of the "long, chatty, just-folks" flavor of tech spam that's hard to catch; the list of clues contains "cookies", "editor", "ms-dos", "backslashes", "guis", "commands", "folder", "dumb", "(well,", "cursor", and "trick" (a spamprob 0.00183748 word!).
For my original purpose of looking at a scheme for c.l.py traffic, this has become the clear leader among all schemes: while it's more extreme than I might like, it made very few errors, and a miniscule middle ground (less than 0.08% of all msgs) contains 64+% of all errors. 3 FN would survive, and 2 FP, but I don't expect that any usable scheme could do better on this data. Note that Graham combining was also very extreme, but had *no* usable middle ground on this data: all mistakes had scores of almost exactly 0.0 or almost exactly 1.0 (and there were more mistakes).
How does it do for you? An analysis like the above is what I'm looking for, although it surely doesn't need to be so detailed. Here's the .ini file I used:
""" [Classifier] use_chi_squared_combining: True
[TestDriver] spam_cutoff: 0.70
nbuckets: 200 best_cutoff_fp_weight: 10
show_false_positives: True show_false_negatives: True show_best_discriminators: 50 show_spam_lo = 0.40 show_spam_hi = 0.80 show_ham_lo = 0.40 show_ham_hi = 0.80 show_charlimit: 100000 """
Your best spam_cutoff may be different, but the point to this exercise isn't to find the best cutoff, it's to think about the middle ground. Note that I set
show_{ham,spam}_{lo,hi}
to values such that I would see every ham and spam that lived in my presumed middle ground of 0.50-0.80, plus down to 0.40 on the low end. I also set show_charlimit to a large value so that I'd see the full text of each such msg.
Heh: My favorite: Data/Ham/Set7/51781.txt got overall score 0.485+, close to the middle ground cutoff. It's a msg I posted 2 years ago to the day (12 Sep 2000), and consists almost entirely of a rather long transcript of part of the infamous Chicago Seven trial:
http://www.law.umkc.edu/faculty/projects/ftrials/Chicago7/chicago7.html
I learned two things from this <wink>:
1. There are so many unique lexical clues when I post a thing, I can get away with posting anything.
2. "tyranny" is a spam clue, but "nazi" a ham clue:
prob('tyranny') = 0.850877 prob('nazi') = 0.282714
leaving-lexical-clues-amid-faux-intimations-of-profundity-ly y'rs - tim