At present, it appears that S may perform better than relatively untweaked graham (as I guess bogofilter is using), but about the same as tweaked graham as spambayes is using, but has the advantage of showing a spread of results. As I see it there are two possible advantages to the spread: 1) emails in the middle could be classified as "you should manually check these", or 2) the user could adjust for the level of sensitivity he wants (similar to what happens now in Entourage which is otherwise brain-dead as a spam filter). (1) is a attractive but one thing worries me: it may give people a false sense of security -- they may think they only have to look at the middle ones, but occasionally even extreme cases of seeming to be spam might be legit. On the other hand, on those rare occasions, the sender can always try again if he doesn't get a response! ;) Just a couple of thoughts. --Gary -- Gary Robinson CEO Transpose, LLC grobinson@transpose.com 207-942-3463 http://www.emergentmusic.com http://radio.weblogs.com/0101454
On 20020919 (Thu) at 1035:04 -0400, Gary Robinson wrote:
At present, it appears that S may perform better than relatively untweaked graham (as I guess bogofilter is using)
Yes, bogofilter 0.7 (haven't looked at the latest release yet) is using untweaked Graham.
1) emails in the middle could be classified as "you should manually check these", or
(1) is a attractive but one thing worries me: it may give people a false sense of security -- they may think they only have to look at the middle ones, but occasionally even extreme cases of seeming to be spam might be legit.
Yup. In the test I reported earler, the one false positive had an S value of 0.93 or thereabouts. It was a message to a bogofilter mailing list and looked extremely innocuous except for a couple of artefacts in the training database (words that ought to have been seen in many messages but actually showed up predominantly in spam). Ironically, one of these was "fetchmail-5.9.14" -- I'd been adding a lot of spam to the corpus since that version came out :) Shows how easy it is to bias this kind of classification; you really need to feed it similar numbers of similarly-dated messages in order to get good training, and shortcuts can be dangerous... -- | G r e g L o u i s | gpg public key: | | http://www.bgl.nu/~glouis | finger greg@bgl.nu |
[Greg Louis, reports on an experiment using f(w) in a modified bogofilter]
... Yup. In the test I reported earler, the one false positive had an S value of 0.93 or thereabouts. It was a message to a bogofilter mailing list and looked extremely innocuous except for a couple of artefacts in the training database (words that ought to have been seen in many messages but actually showed up predominantly in spam). Ironically, one of these was "fetchmail-5.9.14" -- I'd been adding a lot of spam to the corpus since that version came out :) Shows how easy it is to bias this kind of classification; you really need to feed it similar numbers of similarly-dated messages in order to get good training, and shortcuts can be dangerous...
They sure can be -- many of us here learned this the hard way too. It was so bad for me at first (mixed-source data) that I ignored headers entirely; but that forced us to work harder on other things, and that appears to have given us some huge wins. One thing that really helped me: Our datadase has an integer "kill count" attached to each word. Whenever scoring a new message, the words that survive to the end of the Graham scheme (the "most extreme" words in the message) each get their killcount bumped by 1. Whenever I've gotten great results for bogus reasons, looking at the words with the highest killcounts has instantly pinpointed the cause (for example, when I started tokenizing "To" headers, "bruceg" became one of the two most frequent killer clues in the whole database). This kind of thing is easy for us to do so long as we're doing research; I imagine Eric might balk at the additional database burden in bogofilter, but he shouldn't <wink>. The killcount is much less revealing under the S ("f(w)") scheme, btw -- it tends then to reveal merely which words appear most often.
The attached sets up an experiment: create a vector of 50 "probabilities" at random, uniformly distributed in (0.0, 1.0) combine them using Paul Graham's scheme, and using Gary Robinson's scheme record the results repeat 5000 times The results should look familiar for those playing this game from the start: Result for random vectors of 50 probs, + 0 forced to 0.99 Graham combining 5000 items; mean 0.50; sdev 0.47 -> <stat> min 9.54792e-022; median 0.506715; max 1 * = 35 items 0.00 2051 *********************************************************** 0.05 100 *** 0.10 75 *** 0.15 63 ** 0.20 44 ** 0.25 35 * 0.30 40 ** 0.35 34 * 0.40 30 * 0.45 25 * 0.50 34 * 0.55 32 * 0.60 31 * 0.65 24 * 0.70 39 ** 0.75 43 ** 0.80 56 ** 0.85 55 ** 0.90 108 **** 0.95 2081 ************************************************************ Robinson combining 5000 items; mean 0.50; sdev 0.04 -> <stat> min 0.350831; median 0.500083; max 0.649056 * = 34 items 0.00 0 0.05 0 0.10 0 0.15 0 0.20 0 0.25 0 0.30 0 0.35 20 * 0.40 450 ************** 0.45 2027 ************************************************************ 0.50 2019 ************************************************************ 0.55 452 ************** 0.60 32 * 0.65 0 0.70 0 0.75 0 0.80 0 0.85 0 0.90 0 0.95 0 IOW, Paul's scheme is almost always "certain" given 50 discriminators, even in the face of random input. Gary's is never "certain" then. OTOH, do the experiment all over again, but attach one prob of 0.99 to each random vector of 50 probs. The probs are now systematically biased: Result for random vectors of 50 probs, + 1 forced to 0.99 Graham combining 5000 items; mean 0.65; sdev 0.45 -> <stat> min 8.36115e-021; median 0.992403; max 1 * = 47 items 0.00 1353 ***************************** 0.05 92 ** 0.10 50 ** 0.15 42 * 0.20 40 * 0.25 35 * 0.30 26 * 0.35 31 * 0.40 32 * 0.45 31 * 0.50 23 * 0.55 29 * 0.60 30 * 0.65 31 * 0.70 45 * 0.75 33 * 0.80 58 ** 0.85 84 ** 0.90 113 *** 0.95 2822 ************************************************************* Robinson combining 5000 items; mean 0.51; sdev 0.04 -> <stat> min 0.377845; median 0.513446; max 0.637992 * = 42 items 0.00 0 0.05 0 0.10 0 0.15 0 0.20 0 0.25 0 0.30 0 0.35 2 * 0.40 181 ***** 0.45 1549 ************************************* 0.50 2527 ************************************************************* 0.55 698 ***************** 0.60 43 ** 0.65 0 0.70 0 0.75 0 0.80 0 0.85 0 0.90 0 0.95 0 There's a dramatic difference in the Paul results, while the Gary results move sublty (in comparison). If we force 10 additional .99 spamprobs, the differences are night and day: Result for random vectors of 50 probs, + 10 forced to 0.99 Graham combining 5000 items; mean 1.00; sdev 0.01 -> <stat> min 0.213529; median 1; max 1 * = 82 items 0.00 0 0.05 0 0.10 0 0.15 0 0.20 1 * 0.25 0 0.30 1 * 0.35 0 0.40 0 0.45 0 0.50 0 0.55 0 0.60 0 0.65 0 0.70 0 0.75 0 0.80 0 0.85 0 0.90 0 0.95 4998 ************************************************************* Robinson combining 5000 items; mean 0.59; sdev 0.03 -> <stat> min 0.49794; median 0.58555; max 0.694905 * = 51 items 0.00 0 0.05 0 0.10 0 0.15 0 0.20 0 0.25 0 0.30 0 0.35 0 0.40 0 0.45 2 * 0.50 412 ********* 0.55 3068 ************************************************************* 0.60 1447 ***************************** 0.65 71 ** 0.70 0 0.75 0 0.80 0 0.85 0 0.90 0 0.95 0 It's hard to know what to make of this, especially in light of the claim that Gary-combining has been proven to be the most sensitive possible test for rejecting the hypothesis that a collection of probs is uniformly distributed. At least in this test, Paul-combining seemed far more sensitive (even when the data is random <wink>). Intuitively, it *seems* like it would be good to get something not so insanely sensitive to random input as Paul-combining, but more sensitive to overwhelming amounts of evidence than Gary-combining. Even forcing 50 spamprobs of 0.99, the latter only moves up to an average of 0.7: Result for random vectors of 50 probs, + 50 forced to 0.99 Graham combining 5000 items; mean 1.00; sdev 0.00 -> <stat> min 1; median 1; max 1 * = 82 items 0.00 0 0.05 0 0.10 0 0.15 0 0.20 0 0.25 0 0.30 0 0.35 0 0.40 0 0.45 0 0.50 0 0.55 0 0.60 0 0.65 0 0.70 0 0.75 0 0.80 0 0.85 0 0.90 0 0.95 5000 ************************************************************* Robinson combining 5000 items; mean 0.70; sdev 0.02 -> <stat> min 0.628976; median 0.704543; max 0.810235 * = 45 items 0.00 0 0.05 0 0.10 0 0.15 0 0.20 0 0.25 0 0.30 0 0.35 0 0.40 0 0.45 0 0.50 0 0.55 0 0.60 40 * 0.65 2070 ********************************************** 0.70 2743 ************************************************************* 0.75 146 **** 0.80 1 * 0.85 0 0.90 0 0.95 0
So then, "Tim Peters" <tim@zope.com> is all like:
The attached sets up an experiment:
create a vector of 50 "probabilities" at random, uniformly distributed in (0.0, 1.0)
combine them using Paul Graham's scheme, and using Gary Robinson's scheme
record the results
repeat 5000 times
The results should look familiar for those playing this game from the start:
Heh, I got an exception: Traceback (most recent call last): File "combine.py", line 56, in ? h1.display() File "Histogram.py", line 116, in display raise ValueError("nbuckets %g > 0 required" % nbuckets) TypeError: float argument required I patched Histogram.py to do what I think you meant (Also ITYM "buckets", not "buckts"): @@ -111,6 +112,8 @@ # buckts to a list of nbuckets counts, but only if at least one # data point is in the collection. def display(self, nbuckets=None, WIDTH=61): + if nbuckets is None: + nbuckets = self.nbuckets if nbuckets <= 0: raise ValueError("nbuckets %g > 0 required" % nbuckets) self.compute_stats() Submitted for your approval, Neale
On 08-Oct-02 Tim Peters wrote:
It's hard to know what to make of this, especially in light of the claim that Gary-combining has been proven to be the most sensitive possible test for rejecting the hypothesis that a collection of probs is uniformly distributed. At least in this test, Paul-combining seemed far more sensitive (even when the data is random <wink>).
Intuitively, it *seems* like it would be good to get something not so insanely sensitive to random input as Paul-combining, but more sensitive to overwhelming amounts of evidence than Gary-combining. Even forcing 50 spamprobs of 0.99, the latter only moves up to an average of 0.7:
Since my last msg was incomprehensible, I'm just going to attach my code at the bottom and refer to it. Graham's original score calculation - product/(product + inverseProducct) does give the kind of score distribution you described. If you substitute Gary Robinson's suggestion (see below - last few lines), the score distribution does spread out to the center a little bit. You can get Robinson's scoring calculation (as below) to produce a normal distribution around the mean ham or spam score if you either: a. Increase VECTOR_SIZE (max_discriminators??) - a value of around 100 seems to do pretty well b. Instead of selecting the most extreme N word probabilities from the msg being tested, select the words randomly from the list of words in the msg (not shown in code below). You immediately (VECTOR_SIZE = 15) get a normal distribution around the means, but accuracy sucks until you select 75 to 100 words/msg randomly. Neither (a) nor (b) works as well as the 15 most extreme words on my test data. Also, Robinson's calculation doesn't produce ham at 0.99 or spam at 0.01 - in fact the msgs that I had a hard time classifying manually are (mostly) the ones that fall near the cutoff. Note also that the code below will produce an unpredictable score if the msg contains only 30 .01 words and 30 .99 words. It depends on how pairs.sort (...) handles ties. Making the limits asymmetrical (eg .989 and .01 instead of .99/.01) doesn't seem to work very well. The other thing that helps make the scores extreme in actual use is that the distribution of word probabilities is extreme. For my corpora using the code below I get 169378 unique tokens (from 24000 msgs, 50% spam): Probability Number of Tokens % of Total [0.00, 0.01) 46329 27.4% (never in spam) [0.99, 1.00) 104367 61.7% (never in ham) ----- 89.1%
From looking at failures (and assuming passes behave similarly) the 10.9% (~17000 tokens) in between 0.01 and 0.99 still do a lot of the work, which makes sense, since those are the most commonly used words.
My experience has been that the tail tips of the score distribution maintain about the same distance from the mean score no matter what you do. If you improve the shape of the distribution (make it look more normal), you move the tails about the same distance as the distribution has spread out, and the ham and spam tails overlap more and more, increasing the fp/fn rates. The little testing I did on Spambayes (last week's CVS) seemed to show the same effect. For the code below, if I train on 8000 msgs (50% spam) and then test 200, retrain on those 200, and repeat for 16000 msgs, I get 4 fns (3 are identical msgs from the same sender with different dates, all are Klez msgs) and 1 fp (an ISP msg "Routine Service Maintenance"), which are fn and fp rates of 0.05% and 0.01%. The failures all scored in the range [0.495, 0.511] (cuttoff at 0.50) I ran the the SA Corpus today also and don't get any failures if I train on 8K of my msgs and 50/100 of their msgs (worse results under other conditions), but the sample sizes there are too small to do an adequete training sample and have enough test data to have confidence in the results. I can post those results if anyone is interested. Graham's method was basically designed to produce extreme scores, and the distribution of words in the data seems to reinforce that. If it's of any use to anybody (it's certainly beyond me), both the distribution of msg scores and distribution of word probabilities look like exponential or Weibull distributions. (They're "bathtub" curves, if anyone is familiar with reliability statistics). This is all based on my data, which is not the same as your data. YMMV. Jim # classes posted to c.l.p by Erik Max Francis # algorithm from Paul Graham ("A Plan for Spam") # was TOKEN_RE = re.compile(r"[a-zA-Z0-9'$_-]+") # changed to catch Asian charsets TOKEN_RE = re.compile(r"[\w'$_-]+", re.U) FREQUENCY_THRESHHOLD = 1 # was 5 GOOD_BIAS = 2.0 BAD_BIAS = 1.0 # changed to improve distribution 'width' because # of smaller token count in training data GOOD_PROB = 0.0001 # was 0.01 BAD_PROB = 0.9999 # was 0.99 VECTOR_SIZE = 15 UNKNOWN_PROB = 0.5 # was 0.4 or 0.2 # remove mixed alphanumerics or strictly numeric: # eg: HM6116, 555N, 1234 (also Windows98, 133t, h4X0r) pn1_re = re.compile (r"[a-zA-Z]+[0-9]+") pn2_re = re.compile (r"[0-9]+[a-zA-Z]+") num_re = re.compile (r"^[0-9]+") class Corpus(dict): # instantiate one training Corpus for spam, one for ham, # and then one Corpus for each test msg as msgs are tested # (the msg Corpus instance is destroyed after # testing the msg) def __init__(self, data=None): dict.__init__(self) self.count = 0 if data is not None: self.process(data) # process is used to extract tokens from msg, # either in building the training sample or # when testing a msg (can process entire msg # or one part of msg at a time) # 'data' is a string def process(self, data): tokens = TOKEN_RE.findall(str (data)) if not len (tokens): return # added the first 'if' in the loop to reduce # total # of tokens by >75% deletes = 0 for token in tokens: if (len (token) > 20)\ or (pn1_re.search (token) != None)\ or (pn2_re.search (token) != None)\ or (num_re.search (token) != None): deletes += 1 continue if self.has_key(token): self[token] += 1 else: self[token] = 1 # count tokens, not msgs self.count += len (tokens) - deletes class Database(dict): def __init__(self, good, bad): dict.__init__(self) self.build(good, bad) # 'build' constructs the dict of token: probability # run once after training from the ham/spam Corpus # instances; the ham/spam Corpus instances can be # destroyed (after saving?) after 'build' is run def build(self, good, bad): ngood = good.count nbad = bad.count # print ngood, nbad, float(nbad)/float(ngood) for token in good.keys() + bad.keys(): # doubles up, but # works if not self.has_key(token): g = GOOD_BIAS*good.get(token, 0) b = BAD_BIAS*bad.get(token, 0) if g + b >= FREQUENCY_THRESHHOLD: # the 'min's are leftovers from counting # msgs instead of tokens for ngood, nbad goodMetric = min(1.0, g/ngood) badMetric = min(1.0, b/nbad) total = goodMetric + badMetric prob = max(GOOD_PROB,\ min(BAD_PROB,badMetric/total)) self[token] = prob def scan(self, corpus): pairs = [(token, self.get(token, UNKNOWN_PROB)) \ for token in corpus.keys()] pairs.sort(lambda x, y: cmp(abs(y[1] - 0.5), abs(x[1]\ - 0.5))) significant = pairs[:VECTOR_SIZE] inverseProduct = product = 1.0 for token, prob in significant: product *= prob inverseProduct *= 1.0 - prob # Graham scoring - was: # return pairs, significant, product/(product +\ # inverseProduct) # 'pairs' and 'significant' added to assist data logging, evaluation # Robinson scoring - don't know why, but this works great n = float (len (significant)) # n could be < VECTOR_SIZE # div by zero possible if no headers (and msg has no body) try: P = 1 - inverseProduct ** (1/n) Q = 1 - product ** (1/n) S = (1 + (P - Q)/(P + Q))/2 except: S = 0.99 return pairs, significant, S
[Jim Bublitz]
Since my last msg was incomprehensible,
Not at all -- it's just that nobody knew what parts of it meant <wink>. < I'm just going to attach my code at the bottom and refer to it.
Graham's original score calculation -
product/(product + inverseProducct)
does give the kind of score distribution you described.
That's not a matter of speculation, we observed that endlessly over the first few weeks of this project's life, and the msg you're replying to showed that it's true even when fed random data.
If you substitute Gary Robinson's suggestion (see below - last few lines), the score distribution does spread out to the center a little bit.
It spread out enormously for us -- it was night-and-day difference.
You can get Robinson's scoring calculation (as below) to produce a normal distribution around the mean ham or spam score if you either:
a. Increase VECTOR_SIZE (max_discriminators??) - a value of around 100 seems to do pretty well
We use 150 by default. The distributions look "kinda normal", but tighter than normal toward the endpoints, and looser toward the middle.
b. Instead of selecting the most extreme N word probabilities from the msg being tested, select the words randomly from the list of words in the msg (not shown in code below). You immediately (VECTOR_SIZE = 15) get a normal distribution around the means, but accuracy sucks until you select 75 to 100 words/msg randomly.
Hmm. I'm not *that* much in love with a normal distribution <wink>.
Neither (a) nor (b) works as well as the 15 most extreme words on my test data. Also, Robinson's calculation doesn't produce ham at 0.99 or spam at 0.01 - in fact the msgs that I had a hard time classifying manually are (mostly) the ones that fall near the cutoff.
Right, we call that "the middle ground" here. It's believed to be useful, although everyone seems to ignore it <winK>.
Note also that the code below will produce an unpredictable score if the msg contains only 30 .01 words and 30 .99 words.
That can't happen for us: I believe you're still using Graham's contrived clamping of probabilities into [.01, .99]. We have no limits; if a word appears only in spam, we give it a raw spamprob of 1; if only in ham, a raw spamprob of 0; and then rely on Gary's probability adjustment to soften those extremes in accord with how *many* messages they've been seen in. One nice result is that we don't suffer "cancellation disease" anymore (a factor often implicated in bad errors when we were using Paul's scheme, where up to hundreds each of .01 and .99 words fought to produce a useless result).
It depends on how pairs.sort (...) handles ties.
Under Python 2.3 (CVS) the sort is stable; under previous Pythons, it's stable "by (reliable) accident" so long as there are no more than 100 elements in the list; if more than that, it's unpredictable.
Making the limits asymmetrical (eg .989 and .01 instead of .99/.01) doesn't seem to work very well.
Dump the limits entirely -- we did, and we're happier <wink>.
The other thing that helps make the scores extreme in actual use is that the distribution of word probabilities is extreme. For my corpora using the code below I get 169378 unique tokens (from 24000 msgs, 50% spam):
Probability Number of Tokens % of Total [0.00, 0.01) 46329 27.4% (never in spam) [0.99, 1.00) 104367 61.7% (never in ham) ----- 89.1%
This gets less problematic if you dump the artificial limits. A word that appears only in spam gets a higher spamprob the more spams it appears in then, and so pushes out other unique-to-one-kind words that haven't appeared as often.
From looking at failures (and assuming passes behave similarly) the 10.9% (~17000 tokens) in between 0.01 and 0.99 still do a lot of the work, which makes sense, since those are the most commonly used words.
We've gotten better results under all schemes by pretending that words with spamprobs in (0.4, 0.6) simply don't exist.
My experience has been that the tail tips of the score distribution maintain about the same distance from the mean score no matter what you do. If you improve the shape of the distribution (make it look more normal), you move the tails about the same distance as the distribution has spread out, and the ham and spam tails overlap more and more, increasing the fp/fn rates. The little testing I did on Spambayes (last week's CVS) seemed to show the same effect.
That jibes with my experience too. "The problem" in my data, though, is that some ham simply have nothing hammish about them, unless viewed in the light of semantic knowledge this kind of scheme doesn't have; likewise the only thing spammish about some of the spam is that people here have decided to call them spam <wink>.
For the code below, if I train on 8000 msgs (50% spam) and then test 200, retrain on those 200, and repeat for 16000 msgs, I get 4 fns (3 are identical msgs from the same sender with different dates, all are Klez msgs) and 1 fp (an ISP msg "Routine Service Maintenance"), which are fn and fp rates of 0.05% and 0.01%. The failures all scored in the range [0.495, 0.511] (cuttoff at 0.50)
Whatever you're doing is certainly working well for you!
I ran the the SA Corpus
What is this? A SpamAssassin corpus? Nobody has mentioned that here before.
today also and don't get any failures if I train on 8K of my msgs and 50/100 of their msgs (worse results under other conditions), but the sample sizes there are too small to do an adequete training sample and have enough test data to have confidence in the results. I can post ? those results if anyone is interested.
Graham's method was basically designed to produce extreme scores, and the distribution of words in the data seems to reinforce that.
If it's of any use to anybody (it's certainly beyond me), both the distribution of msg scores and distribution of word probabilities look like exponential or Weibull distributions. (They're "bathtub" curves, if anyone is familiar with reliability statistics).
This is all based on my data, which is not the same as your data. YMMV.
Jim
# classes posted to c.l.p by Erik Max Francis # algorithm from Paul Graham ("A Plan for Spam")
# was TOKEN_RE = re.compile(r"[a-zA-Z0-9'$_-]+") # changed to catch Asian charsets TOKEN_RE = re.compile(r"[\w'$_-]+", re.U) FREQUENCY_THRESHHOLD = 1 # was 5 GOOD_BIAS = 2.0 BAD_BIAS = 1.0 # changed to improve distribution 'width' because # of smaller token count in training data GOOD_PROB = 0.0001 # was 0.01 BAD_PROB = 0.9999 # was 0.99 VECTOR_SIZE = 15 UNKNOWN_PROB = 0.5 # was 0.4 or 0.2
# remove mixed alphanumerics or strictly numeric: # eg: HM6116, 555N, 1234 (also Windows98, 133t, h4X0r) pn1_re = re.compile (r"[a-zA-Z]+[0-9]+") pn2_re = re.compile (r"[0-9]+[a-zA-Z]+") num_re = re.compile (r"^[0-9]+")
class Corpus(dict): # instantiate one training Corpus for spam, one for ham, # and then one Corpus for each test msg as msgs are tested # (the msg Corpus instance is destroyed after # testing the msg)
def __init__(self, data=None): dict.__init__(self) self.count = 0 if data is not None: self.process(data)
# process is used to extract tokens from msg, # either in building the training sample or # when testing a msg (can process entire msg # or one part of msg at a time) # 'data' is a string
def process(self, data): tokens = TOKEN_RE.findall(str (data)) if not len (tokens): return
# added the first 'if' in the loop to reduce # total # of tokens by >75% deletes = 0 for token in tokens: if (len (token) > 20)\ or (pn1_re.search (token) != None)\ or (pn2_re.search (token) != None)\ or (num_re.search (token) != None): deletes += 1 continue
if self.has_key(token): self[token] += 1 else: self[token] = 1
# count tokens, not msgs self.count += len (tokens) - deletes
class Database(dict): def __init__(self, good, bad): dict.__init__(self) self.build(good, bad)
# 'build' constructs the dict of token: probability # run once after training from the ham/spam Corpus # instances; the ham/spam Corpus instances can be # destroyed (after saving?) after 'build' is run
def build(self, good, bad): ngood = good.count nbad = bad.count # print ngood, nbad, float(nbad)/float(ngood)
for token in good.keys() + bad.keys(): # doubles up, but # works if not self.has_key(token): g = GOOD_BIAS*good.get(token, 0) b = BAD_BIAS*bad.get(token, 0)
if g + b >= FREQUENCY_THRESHHOLD: # the 'min's are leftovers from counting # msgs instead of tokens for ngood, nbad goodMetric = min(1.0, g/ngood) badMetric = min(1.0, b/nbad) total = goodMetric + badMetric prob = max(GOOD_PROB,\ min(BAD_PROB,badMetric/total))
self[token] = prob
def scan(self, corpus): pairs = [(token, self.get(token, UNKNOWN_PROB)) \ for token in corpus.keys()]
pairs.sort(lambda x, y: cmp(abs(y[1] - 0.5), abs(x[1]\ - 0.5))) significant = pairs[:VECTOR_SIZE]
inverseProduct = product = 1.0 for token, prob in significant: product *= prob inverseProduct *= 1.0 - prob
# Graham scoring - was: # return pairs, significant, product/(product +\ # inverseProduct) # 'pairs' and 'significant' added to assist data logging, evaluation
# Robinson scoring - don't know why, but this works great
n = float (len (significant)) # n could be < VECTOR_SIZE
# div by zero possible if no headers (and msg has no body) try: P = 1 - inverseProduct ** (1/n) Q = 1 - product ** (1/n) S = (1 + (P - Q)/(P + Q))/2 except: S = 0.99
return pairs, significant, S
Oops! Sorry about that. This was an ancient partial reply that appears to have escaped from my Drafts folder; I'm not sure how, but I didn't intend to send it.
[Jim Bublitz]
Since my last msg was incomprehensible,
Not at all -- it's just that nobody knew what parts of it meant <wink>. ...
On 26-Apr-03 Tim Peters wrote:
Oops! Sorry about that. This was an ancient partial reply that appears to have escaped from my Drafts folder; I'm not sure how, but I didn't intend to send it.
No problem. Seeing [Spambayes] in the subj line just made me think my mail system had gone totally berserk (since I unsubscribed quite a while ago). My spam filter, after working flawlessly until mid-Feb, did break (word db got wiped out ???) and it isn't retraining itself very well/very quickly. About 10% spam is getting through - pretty awful. I was hoping it would converge to its original performance, but it looks like I'm going to have to fix it one of these days. It does make me wonder if the original performance was a fluke (or if something is still broken). Jim
[Tim]
... Intuitively, it *seems* like it would be good to get something not so insanely sensitive to random input as Paul-combining, but more sensitive to overwhelming amounts of evidence than Gary-combining.
So there's a new option, [Classifier] use_tim_combining: True The comments (from Options.py) explain it: # For the default scheme, use "tim-combining" of probabilities. This # has no effect under the central-limit schemes. Tim-combining is a # kind of cross between Paul Graham's and Gary Robinson's combining # schemes. Unlike Paul's, it's never crazy-certain, and compared to # Gary's, in Tim's tests it greatly increased the spread between mean # ham-scores and spam-scores, while simultaneously decreasing the # variance of both. Tim needed a higher spam_cutoff value for best # results, but spam_cutoff is less touchy than under Gary-combining. use_tim_combining: False "Tim combining" simply takes the geometric mean of the spamprobs as a measure of spamminess S, and the geometric mean of 1-spamprob as a measure of hamminess H, then returns S/(S+H) as "the score". This is well-behaved when fed random, uniformly distributed probabilities, but isn't reluctant to let an overwhelming number of extreme clues lead it to an extreme conclusion (although you're not going to see it give Graham-like 1e-30 or 1.0000000000000 scores). Don't use a central-limit scheme with this (it has no effect on those). If you test it, use whatever variations on the "all default" scheme you usually use, but it will probably help to boost spam_cutoff. Note that the default max_discriminators is still 150, and that's what I used below. Here's a 10-set cross-validation run on my data, restricted to 100 ham and 100 spam per set, with all defaults, except before after ------ ----- use_tim_combining False True spam_cutoff 0.55 0.615 -> <stat> tested 100 hams & 100 spams against 900 hams & 900 spams [ditto 19 times] false positive percentages 0.000 0.000 tied 1.000 0.000 won -100.00% 1.000 1.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied won 1 times tied 9 times lost 0 times total unique fp went from 2 to 1 won -50.00% mean fp % went from 0.2 to 0.1 won -50.00% false negative percentages 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 1.000 1.000 tied 0.000 0.000 tied won 0 times tied 10 times lost 0 times total unique fn went from 1 to 1 tied mean fn % went from 0.1 to 0.1 tied The real story here is in the score distributions; contrary to what the comment said above, the ham-score variance increased with this little data: ham mean ham sdev 30.63 18.80 -38.62% 6.03 6.83 +13.27% 29.31 17.35 -40.81% 5.48 6.84 +24.82% 29.96 18.50 -38.25% 6.95 9.02 +29.78% 29.66 18.12 -38.91% 5.89 6.81 +15.62% 29.51 17.34 -41.24% 5.73 6.71 +17.10% 29.40 17.43 -40.71% 5.73 6.61 +15.36% 29.75 17.74 -40.37% 5.76 6.96 +20.83% 29.71 18.17 -38.84% 5.97 6.48 +8.54% 31.98 20.41 -36.18% 5.96 8.02 +34.56% 29.83 18.11 -39.29% 4.75 5.41 +13.89% ham mean and sdev for all runs 29.97 18.20 -39.27% 5.90 7.08 +20.00% spam mean spam sdev 79.23 88.38 +11.55% 6.96 5.52 -20.69% 79.40 88.70 +11.71% 7.00 5.64 -19.43% 78.68 88.06 +11.92% 6.69 5.13 -23.32% 79.65 89.01 +11.75% 7.20 5.22 -27.50% 79.91 88.87 +11.21% 6.35 4.67 -26.46% 80.47 89.16 +10.80% 7.22 6.06 -16.07% 80.94 89.78 +10.92% 6.60 4.45 -32.58% 80.30 89.41 +11.34% 6.95 5.49 -21.01% 78.54 87.70 +11.66% 7.30 6.45 -11.64% 80.06 89.06 +11.24% 6.98 5.43 -22.21% spam mean and sdev for all runs 79.72 88.81 +11.40% 6.97 5.47 -21.52% ham/spam mean difference: 49.75 70.61 +20.86 So before, the score equidistant from both means was 52.78, at 3.87 sdevs from each; after, it was 58.03, at 5.63 sdevs from each. The populations are much better separated by this measure. Histograms before: -> <stat> Ham scores for all runs: 1000 items; mean 29.97; sdev 5.90 -> <stat> min 13.521; median 29.6919; max 60.8937 * = 2 items ... 13 2 * 14 0 15 2 * 16 8 **** 17 4 ** 18 9 ***** 19 17 ********* 20 14 ******* 21 16 ******** 22 24 ************ 23 38 ******************* 24 47 ************************ 25 62 ******************************* 26 65 ********************************* 27 69 *********************************** 28 73 ************************************* 29 70 *********************************** 30 76 ************************************** 31 70 *********************************** 32 61 ******************************* 33 51 ************************** 34 50 ************************* 35 34 ***************** 36 30 *************** 37 27 ************** 38 18 ********* 39 12 ****** 40 11 ****** 41 13 ******* 42 2 * 43 5 *** 44 8 **** 45 2 * 46 1 * 47 3 ** 48 1 * 49 0 50 3 ** 51 0 52 0 53 0 54 0 55 1 * 56 0 57 0 58 0 59 0 60 1 * ... -> <stat> Spam scores for all runs: 1000 items; mean 79.72; sdev 6.97 -> <stat> min 52.3428; median 79.9799; max 98.1879 * = 2 items ... 52 1 * 53 0 54 0 55 0 56 3 ** 57 1 * 58 0 59 1 * 60 4 ** 61 4 ** 62 4 ** 63 3 ** 64 4 ** 65 7 **** 66 9 ***** 67 10 ***** 68 13 ******* 69 16 ******** 70 26 ************* 71 18 ********* 72 29 *************** 73 35 ****************** 74 40 ******************** 75 39 ******************** 76 56 **************************** 77 52 ************************** 78 50 ************************* 79 76 ************************************** 80 60 ****************************** 81 77 *************************************** 82 45 *********************** 83 61 ******************************* 84 50 ************************* 85 43 ********************** 86 41 ********************* 87 33 ***************** 88 19 ********** 89 11 ****** 90 11 ****** 91 8 **** 92 2 * 93 9 ***** 94 4 ** 95 9 ***** 96 2 * 97 11 ****** 98 3 ** 99 0 Histograms after: -> <stat> Ham scores for all runs: 1000 items; mean 18.20; sdev 7.08 -> <stat> min 5.6946; median 17.1757; max 73.1302 * = 2 items ... 5 1 * 6 13 ******* 7 16 ******** 8 25 ************* 9 22 *********** 10 37 ******************* 11 45 *********************** 12 56 **************************** 13 70 *********************************** 14 61 ******************************* 15 66 ********************************* 16 79 **************************************** 17 63 ******************************** 18 59 ****************************** 19 59 ****************************** 20 56 **************************** 21 47 ************************ 22 36 ****************** 23 37 ******************* 24 32 **************** 25 9 ***** 26 20 ********** 27 17 ********* 28 8 **** 29 7 **** 30 11 ****** 31 6 *** 32 7 **** 33 5 *** 34 4 ** 35 2 * 36 2 * 37 6 *** 38 1 * 39 0 40 3 ** 41 3 ** 42 0 43 1 * 44 1 * 45 1 * 46 0 47 1 * 48 0 49 0 50 2 * 51 1 * 52 0 53 0 54 0 55 0 56 0 57 0 58 0 59 0 60 0 61 1 * 62 0 63 0 64 0 65 0 66 0 67 0 68 0 69 0 70 0 71 0 72 0 73 1 * -> <stat> Spam scores for all runs: 1000 items; mean 88.81; sdev 5.47 -> <stat> min 54.9382; median 89.5188; max 98.3805 * = 2 items ... 54 1 * 55 0 56 0 57 0 58 0 59 0 60 0 61 0 62 0 63 1 * 64 3 ** 65 0 66 1 * 67 0 68 2 * 69 2 * 70 3 ** 71 3 ** 72 2 * 73 2 * 74 4 ** 75 4 ** 76 6 *** 77 8 **** 78 8 **** 79 6 *** 80 12 ****** 81 25 ************* 82 26 ************* 83 25 ************* 84 39 ******************** 85 58 ***************************** 86 70 *********************************** 87 64 ******************************** 88 74 ************************************* 89 106 ***************************************************** 90 85 ******************************************* 91 62 ******************************* 92 86 ******************************************* 93 79 **************************************** 94 37 ******************* 95 23 ************ 96 42 ********************* 97 25 ************* 98 6 *** 99 0 There are snaky tails in either case, but "the middle ground" here is larger, sparser, and still contains the errors. Across my full test data, which I actually ran first, you can ignore the "won/lost" business; I had spam_cutoff at 0.55 for both runs, and the overall results would have been virtually identical had I boosted spam_cutoff in the second run (recall that I can't demonstrate an improvement on this data anymore! I can only determine whether something is a disaster, and this ain't). -> <stat> tested 2000 hams & 1400 spams against 18000 hams & 12600 spams [ditto 19 times] ... false positive percentages 0.000 0.050 lost +(was 0) 0.000 0.050 lost +(was 0) 0.000 0.050 lost +(was 0) 0.000 0.000 tied 0.050 0.100 lost +100.00% 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.050 0.050 tied won 0 times tied 6 times lost 4 times total unique fp went from 2 to 6 lost +200.00% mean fp % went from 0.01 to 0.03 lost +200.00% false negative percentages 0.000 0.000 tied 0.071 0.071 tied 0.000 0.000 tied 0.071 0.071 tied 0.143 0.071 won -50.35% 0.143 0.000 won -100.00% 0.143 0.143 tied 0.143 0.000 won -100.00% 0.071 0.000 won -100.00% 0.000 0.000 tied won 4 times tied 6 times lost 0 times total unique fn went from 11 to 5 won -54.55% mean fn % went from 0.0785714285714 to 0.0357142857143 won -54.55% ham mean ham sdev 25.65 10.68 -58.36% 5.67 5.44 -4.06% 25.61 10.68 -58.30% 5.50 5.29 -3.82% 25.57 10.68 -58.23% 5.67 5.49 -3.17% 25.66 10.71 -58.26% 5.54 5.27 -4.87% 25.42 10.55 -58.50% 5.72 5.71 -0.17% 25.51 10.43 -59.11% 5.39 5.11 -5.19% 25.65 10.40 -59.45% 5.59 5.29 -5.37% 25.61 10.51 -58.96% 5.41 5.21 -3.70% 25.84 10.80 -58.20% 5.48 5.30 -3.28% 25.81 10.85 -57.96% 5.81 5.73 -1.38% ham mean and sdev for all runs 25.63 10.63 -58.53% 5.58 5.39 -3.41% spam mean spam sdev 83.86 93.17 +11.10% 7.09 4.55 -35.83% 83.64 93.16 +11.38% 6.83 4.52 -33.82% 83.27 92.91 +11.58% 6.81 4.52 -33.63% 83.82 93.14 +11.12% 6.88 4.67 -32.12% 83.89 93.29 +11.21% 6.65 4.56 -31.43% 83.78 93.11 +11.14% 6.96 4.72 -32.18% 83.42 93.00 +11.48% 6.82 4.74 -30.50% 83.86 93.29 +11.24% 6.71 4.55 -32.19% 83.88 93.22 +11.13% 6.98 4.71 -32.52% 83.75 93.28 +11.38% 6.65 4.32 -35.04% spam mean and sdev for all runs 83.72 93.16 +11.28% 6.84 4.59 -32.89% ham/spam mean difference: 58.09 82.53 +24.44 So the equidistant score changed from 51.73 at 4.68 sdevs from each mean, to 55.20 at 8.27 sdevs from each. That's big. The "after" histograms had 200 buckets in this run: -> <stat> Ham scores for all runs: 20000 items; mean 10.63; sdev 5.39 -> <stat> min 0.281945; median 9.69929; max 81.9673 * = 17 items 0.0 7 * 0.5 13 * 1.0 21 ** 1.5 41 *** 2.0 86 ****** 2.5 166 ********** 3.0 239 *************** 3.5 326 ******************** 4.0 466 **************************** 4.5 554 ********************************* 5.0 642 ************************************** 5.5 701 ****************************************** 6.0 793 *********************************************** 6.5 804 ************************************************ 7.0 933 ******************************************************* 7.5 972 ********************************************************** 8.0 997 *********************************************************** 8.5 934 ******************************************************* 9.0 947 ******************************************************** 9.5 939 ******************************************************** 10.0 839 ************************************************** 10.5 786 *********************************************** 11.0 752 ********************************************* 11.5 760 ********************************************* 12.0 636 ************************************** 12.5 606 ************************************ 13.0 554 ********************************* 13.5 483 ***************************** 14.0 461 **************************** 14.5 399 ************************ 15.0 360 ********************** 15.5 317 ******************* 16.0 275 ***************** 16.5 224 ************** 17.0 193 ************ 17.5 169 ********** 18.0 172 *********** 18.5 154 ********** 19.0 153 ********* 19.5 92 ****** 20.0 104 ******* 20.5 99 ****** 21.0 74 ***** 21.5 73 ***** 22.0 73 ***** 22.5 50 *** 23.0 38 *** 23.5 50 *** 24.0 38 *** 24.5 34 ** 25.0 26 ** 25.5 39 *** 26.0 24 ** 26.5 34 ** 27.0 18 ** 27.5 15 * 28.0 20 ** 28.5 15 * 29.0 14 * 29.5 15 * 30.0 12 * 30.5 15 * 31.0 14 * 31.5 10 * 32.0 12 * 32.5 6 * 33.0 10 * 33.5 4 * 34.0 8 * 34.5 5 * 35.0 5 * 35.5 6 * 36.0 7 * 36.5 4 * 37.0 2 * 37.5 3 * 38.0 1 * 38.5 4 * 39.0 6 * 39.5 2 * 40.0 2 * 40.5 5 * 41.0 0 41.5 2 * 42.0 3 * 42.5 3 * 43.0 1 * 43.5 2 * 44.0 1 * 44.5 2 * 45.0 1 * 45.5 1 * 46.0 2 * 46.5 0 47.0 3 * 47.5 0 48.0 1 * 48.5 1 * 49.0 1 * 49.5 0 50.0 1 * 50.5 0 51.0 2 * 51.5 0 52.0 1 * 52.5 0 53.0 0 53.5 1 * 54.0 1 * 54.5 2 * 55.0 0 55.5 0 56.0 1 * 56.5 1 * 57.0 0 57.5 0 58.0 0 58.5 1 * 59.0 0 59.5 0 60.0 0 60.5 0 61.0 1 * 61.5 0 62.0 0 62.5 0 63.0 0 63.5 0 64.0 0 64.5 0 65.0 0 65.5 0 66.0 0 66.5 0 67.0 0 67.5 0 68.0 0 68.5 0 69.0 0 69.5 0 70.0 1 * the lady with the long & obnoxious employer-generated sig 70.5 0 71.0 0 71.5 0 72.0 0 72.5 0 73.0 0 73.5 0 74.0 0 74.5 0 75.0 0 75.5 0 76.0 0 76.5 0 77.0 0 77.5 0 78.0 0 78.5 0 79.0 0 79.5 0 80.0 0 80.5 0 81.0 0 81.5 1 * the verbatim quote of a long Nigerian-scam spam ... -> <stat> Spam scores for all runs: 14000 items; mean 93.16; sdev 4.59 -> <stat> min 24.3497; median 93.8141; max 99.6769 * = 15 items ... 24.0 1 * not really sure -- it's a giant base64-encoded plain text file 24.5 0 25.0 0 25.5 0 26.0 0 26.5 0 27.0 0 27.5 0 28.0 0 28.5 0 29.0 1 * the spam with the uuencoded body we throw away 29.5 0 30.0 0 30.5 0 31.0 0 31.5 0 32.0 0 32.5 0 33.0 0 33.5 0 34.0 0 34.5 0 35.0 0 35.5 0 36.0 0 36.5 0 37.0 0 37.5 0 38.0 0 38.5 0 39.0 0 39.5 0 40.0 0 40.5 0 41.0 0 41.5 0 42.0 0 42.5 0 43.0 0 43.5 0 44.0 0 44.5 0 45.0 0 45.5 0 46.0 1 * Hello, my Name is BlackIntrepid 46.5 0 47.0 0 47.5 0 48.0 0 48.5 0 49.0 0 49.5 0 50.0 0 50.5 0 51.0 0 51.5 0 52.0 0 52.5 0 53.0 0 53.5 1 * unclear; a collection of webmaster links 54.0 1 * Susan makes a propsal (sic) to Tim 54.5 0 55.0 1 * 55.5 0 56.0 0 56.5 1 * 57.0 2 * 57.5 0 58.0 0 58.5 1 * 59.0 0 59.5 0 60.0 1 * 60.5 2 * 61.0 1 * 61.5 1 * 62.0 0 62.5 1 * 63.0 1 * 63.5 0 64.0 1 * 64.5 1 * 65.0 0 65.5 1 * 66.0 1 * 66.5 2 * 67.0 4 * 67.5 2 * 68.0 0 68.5 1 * 69.0 0 69.5 3 * 70.0 1 * 70.5 5 * 71.0 5 * 71.5 3 * 72.0 4 * 72.5 3 * 73.0 3 * 73.5 6 * 74.0 3 * 74.5 4 * 75.0 8 * 75.5 8 * 76.0 10 * 76.5 10 * 77.0 10 * 77.5 17 ** 78.0 14 * 78.5 27 ** 79.0 16 ** 79.5 23 ** 80.0 28 ** 80.5 29 ** 81.0 37 *** 81.5 37 *** 82.0 46 **** 82.5 55 **** 83.0 47 **** 83.5 53 **** 84.0 58 **** 84.5 68 ***** 85.0 86 ****** 85.5 118 ******** 86.0 135 ********* 86.5 159 *********** 87.0 165 *********** 87.5 178 ************ 88.0 209 ************** 88.5 231 **************** 89.0 299 ******************** 89.5 391 *************************** 90.0 425 ***************************** 90.5 402 *************************** 91.0 501 ********************************** 91.5 582 *************************************** 92.0 636 ******************************************* 92.5 667 ********************************************* 93.0 713 ************************************************ 93.5 685 ********************************************** 94.0 610 ***************************************** 94.5 621 ****************************************** 95.0 721 ************************************************* 95.5 735 ************************************************* 96.0 870 ********************************************************** 96.5 742 ************************************************** 97.0 449 ****************************** 97.5 447 ****************************** 98.0 556 ************************************** 98.5 561 ************************************** 99.0 264 ****************** 99.5 171 ************ The mistakes are all familiar; the good news is that "the normal cases" are far removed from what might plausibly be called a middle ground. For example, if we called the region from 40 thru 70 here "the middle ground", and kicked those out for manual review, there would be very few msgs to review, but they would contain almost all the mistakes. How does this do on your data? I'm in favor what works <wink>.
The thing about the geometric mean is that it is much more sensitive to numbers near 0, so the S/(S+H) technique is biased in that way. If you want to try something like that, I would suggest using the ARITHMETIC means in computing S and H and again using S(S+H). That would remove that bias. It wouldn't be invoking that optimality theorem, but whatever works... It really seems, as a matter of being educated, that the arithmetic approach is worth trying if it doesn't take a lot of trouble to try it.
"but more sensitive to overwhelming amounts of evidence than Gary-combining"
From the email you sent at 1:02PM yesterday:
0.40 0 0.45 2 * 0.50 412 ********* 0.55 3068 ************************************************************* 0.60 1447 ***************************** 0.65 71 ** 0.70 0 One thing I'd like to be more clear on. If I understand the experiment correctly you set 10 to .99 and 40 were random. What percentage actually ended up as > .5, without regard to HOW MUCH over .5? '
It's hard to know what to make of this, especially in light of the claim that Gary-combining has been proven to be the most sensitive possible test for rejecting the hypothesis that a collection of probs is uniformly distributed.
It's not the (S-H)/(S+H) that is the most sensitive (under certain conditions), it that the geometric mean approach for computing S gives a result that is MONOTONIC WITH a calculation which is the most sensitive. The real technique would take S and feed it into an inverse chi-square function with (in this experiment) 100 degrees of freedom. The output (roughly speaking) would be the probability that that S (or a more extreme one) might have occurred by chance alone. Call these numbers S' and H' for S and H respectively. The calculation (S-H)/(S+H) will be > 0 if and only if (S'-H')/(S'+H') (unless I've made some error). So, as a binary indicator, the two are equivalent. However, if you used S' and H', you would see something more like real probabilities that would probably be of magnitudes that would be more attractive to you. You could probably use a table to approximate the inverse chi-square calc rather than actually doing the computations all the time. I didn't suggest doing that, at first, because I was interested in providing a binary indicator and wanting to keep things simple -- and from the POV of a binary indicator, it doesn't make any difference. So, if it happens that feel like taking the time to go "all the way" with this approach, I would suggest actually computing S' and H' and seeing what happens. I think you would like the results better -- I just didn't suggest it at first because I didn't know the spread would be of such interest and I wanted to keep things simple. I think this would work better than the S/(S+H) approach, because if you use geometric means, it's more sensitive to one condition than the other, and if you use arithmetic means, you don't invoke the optimality theorem. Of course, this is ALL speculative. But the probabilities involved will DEFINATELY be of greater magnitude, and so a better-defined spread, if the inverse chi-square is used. --Gary -- Gary Robinson CEO Transpose, LLC grobinson@transpose.com 207-942-3463 http://www.emergentmusic.com http://radio.weblogs.com/0101454
From: Tim Peters <tim.one@comcast.net> Date: Wed, 09 Oct 2002 20:34:15 -0400 To: SpamBayes <spambayes@python.org> Cc: Gary Robinson <grobinson@transpose.com> Subject: RE: [Spambayes] spamprob combining
[Tim]
... Intuitively, it *seems* like it would be good to get something not so insanely sensitive to random input as Paul-combining, but more sensitive to overwhelming amounts of evidence than Gary-combining.
So there's a new option,
[Classifier] use_tim_combining: True
The comments (from Options.py) explain it:
# For the default scheme, use "tim-combining" of probabilities. This # has no effect under the central-limit schemes. Tim-combining is a # kind of cross between Paul Graham's and Gary Robinson's combining # schemes. Unlike Paul's, it's never crazy-certain, and compared to # Gary's, in Tim's tests it greatly increased the spread between mean # ham-scores and spam-scores, while simultaneously decreasing the # variance of both. Tim needed a higher spam_cutoff value for best # results, but spam_cutoff is less touchy than under Gary-combining. use_tim_combining: False
"Tim combining" simply takes the geometric mean of the spamprobs as a measure of spamminess S, and the geometric mean of 1-spamprob as a measure of hamminess H, then returns S/(S+H) as "the score". This is well-behaved when fed random, uniformly distributed probabilities, but isn't reluctant to let an overwhelming number of extreme clues lead it to an extreme conclusion (although you're not going to see it give Graham-like 1e-30 or 1.0000000000000 scores).
Don't use a central-limit scheme with this (it has no effect on those). If you test it, use whatever variations on the "all default" scheme you usually use, but it will probably help to boost spam_cutoff. Note that the default max_discriminators is still 150, and that's what I used below.
Here's a 10-set cross-validation run on my data, restricted to 100 ham and 100 spam per set, with all defaults, except
before after ------ ----- use_tim_combining False True spam_cutoff 0.55 0.615
-> <stat> tested 100 hams & 100 spams against 900 hams & 900 spams [ditto 19 times]
false positive percentages 0.000 0.000 tied 1.000 0.000 won -100.00% 1.000 1.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied
won 1 times tied 9 times lost 0 times
total unique fp went from 2 to 1 won -50.00% mean fp % went from 0.2 to 0.1 won -50.00%
false negative percentages 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 1.000 1.000 tied 0.000 0.000 tied
won 0 times tied 10 times lost 0 times
total unique fn went from 1 to 1 tied mean fn % went from 0.1 to 0.1 tied
The real story here is in the score distributions; contrary to what the comment said above, the ham-score variance increased with this little data:
ham mean ham sdev 30.63 18.80 -38.62% 6.03 6.83 +13.27% 29.31 17.35 -40.81% 5.48 6.84 +24.82% 29.96 18.50 -38.25% 6.95 9.02 +29.78% 29.66 18.12 -38.91% 5.89 6.81 +15.62% 29.51 17.34 -41.24% 5.73 6.71 +17.10% 29.40 17.43 -40.71% 5.73 6.61 +15.36% 29.75 17.74 -40.37% 5.76 6.96 +20.83% 29.71 18.17 -38.84% 5.97 6.48 +8.54% 31.98 20.41 -36.18% 5.96 8.02 +34.56% 29.83 18.11 -39.29% 4.75 5.41 +13.89%
ham mean and sdev for all runs 29.97 18.20 -39.27% 5.90 7.08 +20.00%
spam mean spam sdev 79.23 88.38 +11.55% 6.96 5.52 -20.69% 79.40 88.70 +11.71% 7.00 5.64 -19.43% 78.68 88.06 +11.92% 6.69 5.13 -23.32% 79.65 89.01 +11.75% 7.20 5.22 -27.50% 79.91 88.87 +11.21% 6.35 4.67 -26.46% 80.47 89.16 +10.80% 7.22 6.06 -16.07% 80.94 89.78 +10.92% 6.60 4.45 -32.58% 80.30 89.41 +11.34% 6.95 5.49 -21.01% 78.54 87.70 +11.66% 7.30 6.45 -11.64% 80.06 89.06 +11.24% 6.98 5.43 -22.21%
spam mean and sdev for all runs 79.72 88.81 +11.40% 6.97 5.47 -21.52%
ham/spam mean difference: 49.75 70.61 +20.86
So before, the score equidistant from both means was 52.78, at 3.87 sdevs from each; after, it was 58.03, at 5.63 sdevs from each. The populations are much better separated by this measure.
Histograms before:
-> <stat> Ham scores for all runs: 1000 items; mean 29.97; sdev 5.90 -> <stat> min 13.521; median 29.6919; max 60.8937 * = 2 items ... 13 2 * 14 0 15 2 * 16 8 **** 17 4 ** 18 9 ***** 19 17 ********* 20 14 ******* 21 16 ******** 22 24 ************ 23 38 ******************* 24 47 ************************ 25 62 ******************************* 26 65 ********************************* 27 69 *********************************** 28 73 ************************************* 29 70 *********************************** 30 76 ************************************** 31 70 *********************************** 32 61 ******************************* 33 51 ************************** 34 50 ************************* 35 34 ***************** 36 30 *************** 37 27 ************** 38 18 ********* 39 12 ****** 40 11 ****** 41 13 ******* 42 2 * 43 5 *** 44 8 **** 45 2 * 46 1 * 47 3 ** 48 1 * 49 0 50 3 ** 51 0 52 0 53 0 54 0 55 1 * 56 0 57 0 58 0 59 0 60 1 * ...
-> <stat> Spam scores for all runs: 1000 items; mean 79.72; sdev 6.97 -> <stat> min 52.3428; median 79.9799; max 98.1879 * = 2 items ... 52 1 * 53 0 54 0 55 0 56 3 ** 57 1 * 58 0 59 1 * 60 4 ** 61 4 ** 62 4 ** 63 3 ** 64 4 ** 65 7 **** 66 9 ***** 67 10 ***** 68 13 ******* 69 16 ******** 70 26 ************* 71 18 ********* 72 29 *************** 73 35 ****************** 74 40 ******************** 75 39 ******************** 76 56 **************************** 77 52 ************************** 78 50 ************************* 79 76 ************************************** 80 60 ****************************** 81 77 *************************************** 82 45 *********************** 83 61 ******************************* 84 50 ************************* 85 43 ********************** 86 41 ********************* 87 33 ***************** 88 19 ********** 89 11 ****** 90 11 ****** 91 8 **** 92 2 * 93 9 ***** 94 4 ** 95 9 ***** 96 2 * 97 11 ****** 98 3 ** 99 0
Histograms after:
-> <stat> Ham scores for all runs: 1000 items; mean 18.20; sdev 7.08 -> <stat> min 5.6946; median 17.1757; max 73.1302 * = 2 items ... 5 1 * 6 13 ******* 7 16 ******** 8 25 ************* 9 22 *********** 10 37 ******************* 11 45 *********************** 12 56 **************************** 13 70 *********************************** 14 61 ******************************* 15 66 ********************************* 16 79 **************************************** 17 63 ******************************** 18 59 ****************************** 19 59 ****************************** 20 56 **************************** 21 47 ************************ 22 36 ****************** 23 37 ******************* 24 32 **************** 25 9 ***** 26 20 ********** 27 17 ********* 28 8 **** 29 7 **** 30 11 ****** 31 6 *** 32 7 **** 33 5 *** 34 4 ** 35 2 * 36 2 * 37 6 *** 38 1 * 39 0 40 3 ** 41 3 ** 42 0 43 1 * 44 1 * 45 1 * 46 0 47 1 * 48 0 49 0 50 2 * 51 1 * 52 0 53 0 54 0 55 0 56 0 57 0 58 0 59 0 60 0 61 1 * 62 0 63 0 64 0 65 0 66 0 67 0 68 0 69 0 70 0 71 0 72 0 73 1 *
-> <stat> Spam scores for all runs: 1000 items; mean 88.81; sdev 5.47 -> <stat> min 54.9382; median 89.5188; max 98.3805 * = 2 items ... 54 1 * 55 0 56 0 57 0 58 0 59 0 60 0 61 0 62 0 63 1 * 64 3 ** 65 0 66 1 * 67 0 68 2 * 69 2 * 70 3 ** 71 3 ** 72 2 * 73 2 * 74 4 ** 75 4 ** 76 6 *** 77 8 **** 78 8 **** 79 6 *** 80 12 ****** 81 25 ************* 82 26 ************* 83 25 ************* 84 39 ******************** 85 58 ***************************** 86 70 *********************************** 87 64 ******************************** 88 74 ************************************* 89 106 ***************************************************** 90 85 ******************************************* 91 62 ******************************* 92 86 ******************************************* 93 79 **************************************** 94 37 ******************* 95 23 ************ 96 42 ********************* 97 25 ************* 98 6 *** 99 0
There are snaky tails in either case, but "the middle ground" here is larger, sparser, and still contains the errors.
Across my full test data, which I actually ran first, you can ignore the "won/lost" business; I had spam_cutoff at 0.55 for both runs, and the overall results would have been virtually identical had I boosted spam_cutoff in the second run (recall that I can't demonstrate an improvement on this data anymore! I can only determine whether something is a disaster, and this ain't).
-> <stat> tested 2000 hams & 1400 spams against 18000 hams & 12600 spams [ditto 19 times] ... false positive percentages 0.000 0.050 lost +(was 0) 0.000 0.050 lost +(was 0) 0.000 0.050 lost +(was 0) 0.000 0.000 tied 0.050 0.100 lost +100.00% 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.050 0.050 tied
won 0 times tied 6 times lost 4 times
total unique fp went from 2 to 6 lost +200.00% mean fp % went from 0.01 to 0.03 lost +200.00%
false negative percentages 0.000 0.000 tied 0.071 0.071 tied 0.000 0.000 tied 0.071 0.071 tied 0.143 0.071 won -50.35% 0.143 0.000 won -100.00% 0.143 0.143 tied 0.143 0.000 won -100.00% 0.071 0.000 won -100.00% 0.000 0.000 tied
won 4 times tied 6 times lost 0 times
total unique fn went from 11 to 5 won -54.55% mean fn % went from 0.0785714285714 to 0.0357142857143 won -54.55%
ham mean ham sdev 25.65 10.68 -58.36% 5.67 5.44 -4.06% 25.61 10.68 -58.30% 5.50 5.29 -3.82% 25.57 10.68 -58.23% 5.67 5.49 -3.17% 25.66 10.71 -58.26% 5.54 5.27 -4.87% 25.42 10.55 -58.50% 5.72 5.71 -0.17% 25.51 10.43 -59.11% 5.39 5.11 -5.19% 25.65 10.40 -59.45% 5.59 5.29 -5.37% 25.61 10.51 -58.96% 5.41 5.21 -3.70% 25.84 10.80 -58.20% 5.48 5.30 -3.28% 25.81 10.85 -57.96% 5.81 5.73 -1.38%
ham mean and sdev for all runs 25.63 10.63 -58.53% 5.58 5.39 -3.41%
spam mean spam sdev 83.86 93.17 +11.10% 7.09 4.55 -35.83% 83.64 93.16 +11.38% 6.83 4.52 -33.82% 83.27 92.91 +11.58% 6.81 4.52 -33.63% 83.82 93.14 +11.12% 6.88 4.67 -32.12% 83.89 93.29 +11.21% 6.65 4.56 -31.43% 83.78 93.11 +11.14% 6.96 4.72 -32.18% 83.42 93.00 +11.48% 6.82 4.74 -30.50% 83.86 93.29 +11.24% 6.71 4.55 -32.19% 83.88 93.22 +11.13% 6.98 4.71 -32.52% 83.75 93.28 +11.38% 6.65 4.32 -35.04%
spam mean and sdev for all runs 83.72 93.16 +11.28% 6.84 4.59 -32.89%
ham/spam mean difference: 58.09 82.53 +24.44
So the equidistant score changed from 51.73 at 4.68 sdevs from each mean, to 55.20 at 8.27 sdevs from each. That's big.
The "after" histograms had 200 buckets in this run:
-> <stat> Ham scores for all runs: 20000 items; mean 10.63; sdev 5.39 -> <stat> min 0.281945; median 9.69929; max 81.9673 * = 17 items 0.0 7 * 0.5 13 * 1.0 21 ** 1.5 41 *** 2.0 86 ****** 2.5 166 ********** 3.0 239 *************** 3.5 326 ******************** 4.0 466 **************************** 4.5 554 ********************************* 5.0 642 ************************************** 5.5 701 ****************************************** 6.0 793 *********************************************** 6.5 804 ************************************************ 7.0 933 ******************************************************* 7.5 972 ********************************************************** 8.0 997 *********************************************************** 8.5 934 ******************************************************* 9.0 947 ******************************************************** 9.5 939 ******************************************************** 10.0 839 ************************************************** 10.5 786 *********************************************** 11.0 752 ********************************************* 11.5 760 ********************************************* 12.0 636 ************************************** 12.5 606 ************************************ 13.0 554 ********************************* 13.5 483 ***************************** 14.0 461 **************************** 14.5 399 ************************ 15.0 360 ********************** 15.5 317 ******************* 16.0 275 ***************** 16.5 224 ************** 17.0 193 ************ 17.5 169 ********** 18.0 172 *********** 18.5 154 ********** 19.0 153 ********* 19.5 92 ****** 20.0 104 ******* 20.5 99 ****** 21.0 74 ***** 21.5 73 ***** 22.0 73 ***** 22.5 50 *** 23.0 38 *** 23.5 50 *** 24.0 38 *** 24.5 34 ** 25.0 26 ** 25.5 39 *** 26.0 24 ** 26.5 34 ** 27.0 18 ** 27.5 15 * 28.0 20 ** 28.5 15 * 29.0 14 * 29.5 15 * 30.0 12 * 30.5 15 * 31.0 14 * 31.5 10 * 32.0 12 * 32.5 6 * 33.0 10 * 33.5 4 * 34.0 8 * 34.5 5 * 35.0 5 * 35.5 6 * 36.0 7 * 36.5 4 * 37.0 2 * 37.5 3 * 38.0 1 * 38.5 4 * 39.0 6 * 39.5 2 * 40.0 2 * 40.5 5 * 41.0 0 41.5 2 * 42.0 3 * 42.5 3 * 43.0 1 * 43.5 2 * 44.0 1 * 44.5 2 * 45.0 1 * 45.5 1 * 46.0 2 * 46.5 0 47.0 3 * 47.5 0 48.0 1 * 48.5 1 * 49.0 1 * 49.5 0 50.0 1 * 50.5 0 51.0 2 * 51.5 0 52.0 1 * 52.5 0 53.0 0 53.5 1 * 54.0 1 * 54.5 2 * 55.0 0 55.5 0 56.0 1 * 56.5 1 * 57.0 0 57.5 0 58.0 0 58.5 1 * 59.0 0 59.5 0 60.0 0 60.5 0 61.0 1 * 61.5 0 62.0 0 62.5 0 63.0 0 63.5 0 64.0 0 64.5 0 65.0 0 65.5 0 66.0 0 66.5 0 67.0 0 67.5 0 68.0 0 68.5 0 69.0 0 69.5 0 70.0 1 * the lady with the long & obnoxious employer-generated sig 70.5 0 71.0 0 71.5 0 72.0 0 72.5 0 73.0 0 73.5 0 74.0 0 74.5 0 75.0 0 75.5 0 76.0 0 76.5 0 77.0 0 77.5 0 78.0 0 78.5 0 79.0 0 79.5 0 80.0 0 80.5 0 81.0 0 81.5 1 * the verbatim quote of a long Nigerian-scam spam ...
-> <stat> Spam scores for all runs: 14000 items; mean 93.16; sdev 4.59 -> <stat> min 24.3497; median 93.8141; max 99.6769 * = 15 items ... 24.0 1 * not really sure -- it's a giant base64-encoded plain text file 24.5 0 25.0 0 25.5 0 26.0 0 26.5 0 27.0 0 27.5 0 28.0 0 28.5 0 29.0 1 * the spam with the uuencoded body we throw away 29.5 0 30.0 0 30.5 0 31.0 0 31.5 0 32.0 0 32.5 0 33.0 0 33.5 0 34.0 0 34.5 0 35.0 0 35.5 0 36.0 0 36.5 0 37.0 0 37.5 0 38.0 0 38.5 0 39.0 0 39.5 0 40.0 0 40.5 0 41.0 0 41.5 0 42.0 0 42.5 0 43.0 0 43.5 0 44.0 0 44.5 0 45.0 0 45.5 0 46.0 1 * Hello, my Name is BlackIntrepid 46.5 0 47.0 0 47.5 0 48.0 0 48.5 0 49.0 0 49.5 0 50.0 0 50.5 0 51.0 0 51.5 0 52.0 0 52.5 0 53.0 0 53.5 1 * unclear; a collection of webmaster links 54.0 1 * Susan makes a propsal (sic) to Tim 54.5 0 55.0 1 * 55.5 0 56.0 0 56.5 1 * 57.0 2 * 57.5 0 58.0 0 58.5 1 * 59.0 0 59.5 0 60.0 1 * 60.5 2 * 61.0 1 * 61.5 1 * 62.0 0 62.5 1 * 63.0 1 * 63.5 0 64.0 1 * 64.5 1 * 65.0 0 65.5 1 * 66.0 1 * 66.5 2 * 67.0 4 * 67.5 2 * 68.0 0 68.5 1 * 69.0 0 69.5 3 * 70.0 1 * 70.5 5 * 71.0 5 * 71.5 3 * 72.0 4 * 72.5 3 * 73.0 3 * 73.5 6 * 74.0 3 * 74.5 4 * 75.0 8 * 75.5 8 * 76.0 10 * 76.5 10 * 77.0 10 * 77.5 17 ** 78.0 14 * 78.5 27 ** 79.0 16 ** 79.5 23 ** 80.0 28 ** 80.5 29 ** 81.0 37 *** 81.5 37 *** 82.0 46 **** 82.5 55 **** 83.0 47 **** 83.5 53 **** 84.0 58 **** 84.5 68 ***** 85.0 86 ****** 85.5 118 ******** 86.0 135 ********* 86.5 159 *********** 87.0 165 *********** 87.5 178 ************ 88.0 209 ************** 88.5 231 **************** 89.0 299 ******************** 89.5 391 *************************** 90.0 425 ***************************** 90.5 402 *************************** 91.0 501 ********************************** 91.5 582 *************************************** 92.0 636 ******************************************* 92.5 667 ********************************************* 93.0 713 ************************************************ 93.5 685 ********************************************** 94.0 610 ***************************************** 94.5 621 ****************************************** 95.0 721 ************************************************* 95.5 735 ************************************************* 96.0 870 ********************************************************** 96.5 742 ************************************************** 97.0 449 ****************************** 97.5 447 ****************************** 98.0 556 ************************************** 98.5 561 ************************************** 99.0 264 ****************** 99.5 171 ************
The mistakes are all familiar; the good news is that "the normal cases" are far removed from what might plausibly be called a middle ground. For example, if we called the region from 40 thru 70 here "the middle ground", and kicked those out for manual review, there would be very few msgs to review, but they would contain almost all the mistakes.
How does this do on your data? I'm in favor what works <wink>.
[Gary Robinson]
The thing about the geometric mean is that it is much more sensitive to numbers near 0, so the S/(S+H) technique is biased in that way.
A single geometric mean would surely be biased, but the combination of two used here doesn't appear to be. That is, throwing random data at it, the mean and median are 0.5, and it's symmetric around that: 5000 items; mean 0.50; sdev 0.06 -> <stat> min 0.291521; median 0.500264; max 0.726668 * = 24 items 0.25 3 * 0.30 34 ** 0.35 211 ********* 0.40 816 ********************************** 0.45 1431 ************************************************************ 0.50 1442 ************************************************************* 0.55 809 ********************************** 0.60 219 ********** 0.65 33 ** 0.70 2 * If I do the same random-data experiment and force a prob of 0.99, the mean rises to 0.52; if I force a prob of 0.01, it falls to 0.48. If there's a bias, it's hiding pretty well <wink>. If there is a spamprob near 0, it's very much the intent that S take that seriously, and if one near 1, that H take that seriously; else, as now, I see screaming spam or screaming ham barely cracking scores above 70 or below 30. "Too much ends up in the middle."
If you want to try something like that, I would suggest using the ARITHMETIC means in computing S and H and again using S(S+H). That would remove that bias.
That doesn't appear promising: If S = Smean = (sum p_i)/n and H = Hmean = (sum 1-p_i)/n then Hmean = n/n - Smean = 1 - Smean, and Smean + Hmean = 1. So whether you meant S*(S+H) or S/(S+H), the result is S. To within roundoff error, that's what happens, too.
It wouldn't be invoking that optimality theorem, but whatever works...
I'm not sure the optimality theorem in question is relevant to the task at hand, though. Why should we care abour rejecting a hypothesis that the word probabilities are uniformly distributed? There's virtually no message in which they are, and no reason to believe that the *majority* of words in spam will have spamprobs over 0.5. Graham got results as good as he did because the spamprob strength of a mere handful of words is usually enough to decide it. In a sense, I am trying to move back toward what worked best in his formulation.
It really seems, as a matter of being educated, that the arithmetic approach is worth trying if it doesn't take a lot of trouble to try it.
Nope, no trouble, but my test data can't demonstrate improvements, just disasters. On a brief 10-fold cv run with 100 ham + 100 spam in each set, using the arithmetic spamprob mean gave results pretty much the same as the default scheme; error rates were the same, but the best range for spam_cutoff shifted from 0.52 thru 0.54, to 0.56 thru 0.58; it increased the spread a little: ham mean and sdev for all runs 30.35 30.53 +0.59% 5.83 5.91 +1.37% spam mean and sdev for all runs 80.97 84.08 +3.84% 7.07 6.38 -9.76% ham/spam mean difference: 50.62 53.55 +2.93
"but more sensitive to overwhelming amounts of evidence than Gary-combining"
From the email you sent at 1:02PM yesterday:
0.40 0 0.45 2 * 0.50 412 ********* 0.55 3068 ************************************************************* 0.60 1447 ***************************** 0.65 71 ** 0.70 0
One thing I'd like to be more clear on. If I understand the experiment correctly you set 10 to .99 and 40 were random.
I have to dig up that email to find the context ... OK, this one was tagged Result for random vectors of 50 probs, + 10 forced to 0.99 That means there were 60 probs in all, 50 drawn from (0.0, 1.0), + 10 of 0.99.
What percentage actually ended up as > .5, without regard to HOW MUCH over .5?
From the histogram, all but 2, out of 5000 trials. 0.5 doesn't work as a spam_cutoff on anyone's corpus here, though (it's too low; too many false positives). The median value in that run was 0.58555, which is close to what some people have been using for spam_cutoff.
Under the S/(S+H) scheme, the same experiment yields 5000 items; mean 0.68; sdev 0.05 -> <stat> min 0.490773; median 0.683328; max 0.819528 * = 34 items 0.45 2 * 0.50 27 * 0.55 171 ****** 0.60 991 ****************************** 0.65 2016 ************************************************************ 0.70 1510 ********************************************* 0.75 275 ********* 0.80 8 * So if the percentage above 0.5 is sole the measure of goodness here, S/(S+H) did equally well in this experiement.
... It's not the (S-H)/(S+H) that is the most sensitive (under certain conditions), it that the geometric mean approach for computing S gives a result that is MONOTONIC WITH a calculation which is the most sensitive.
The real technique would take S and feed it into an inverse chi-square function with (in this experiment) 100 degrees of freedom. The output (roughly speaking) would be the probability that that S (or a more extreme one) might have occurred by chance alone.
Call these numbers S' and H' for S and H respectively.
The calculation (S-H)/(S+H) will be > 0 if and only if (S'-H')/(S'+H') (unless I've made some error).
So, as a binary indicator, the two are equivalent. However, if you used S' and H', you would see something more like real probabilities that would probably be of magnitudes that would be more attractive to you.
You could probably use a table to approximate the inverse chi-square calc rather than actually doing the computations all the time.
I didn't suggest doing that, at first, because I was interested in providing a binary indicator and wanting to keep things simple -- and from the POV of a binary indicator, it doesn't make any difference.
It's not a question of attraction <wink> so much as that this "binary indicator" doesn't come with a decision rule for knowing which outcome is which: it varies across corpus, and within a given corpus varies over time, depending on how much data has been trained on. So we get a stream of test results where the numbers have to be fudged retroactively via "but if I had set the cutoff to *this* on this run, the results would have been very different". It's just too delicate as is.
So, if it happens that feel like taking the time to go "all the way" with this approach, I would suggest actually computing S' and H' and seeing what happens.
Sounds like fun.
I think you would like the results better -- I just didn't suggest it at first because I didn't know the spread would be of such interest and I wanted to keep things simple.
That's fine. In practice, the touchiness of spam_cutoff has been an ongoing practical problem; but it's been the *only* ongoing problem, so that's why we're talking about it <wink>>
I think this would work better than the S/(S+H) approach, because if you use geometric means, it's more sensitive to one condition than the other, and if you use arithmetic means, you don't invoke the optimality theorem.
As above, I've found no reason yet to believe S/(S+H) favors one side over the other, and the test runs didn't show me evidence of that either. Indeed, it made the same mistakes on the same messages, but moved mounds of correctly classified message out of "the middle ground".
Of course, this is ALL speculative. But the probabilities involved will DEFINATELY be of greater magnitude, and so a better-defined spread, if the inverse chi-square is used.
It's doable, but the experimental results so far are promising enough that I'm still keener to see how it works for others here.
If you want to try something like that, I would suggest using the ARITHMETIC means in computing S and H and again using S(S+H). That would remove that bias.
That doesn't appear promising:
If S = Smean = (sum p_i)/n
and H = Hmean = (sum 1-p_i)/n
then Hmean = n/n - Smean = 1 - Smean, and Smean + Hmean = 1. So whether you meant S*(S+H) or S/(S+H), the result is S. To within roundoff error, that's what happens, too.
Ha ha ha! I should have thought of that! :) Gary
If you do decide to try the chi-square thing, the idea is to find or create a function (perhaps using a lookup table) that takes a chi-square random variable and outputs the associated p-value. The input random variable is the product of the p's or (1-p)'s as the case may be. If the p's are uniformly distributed under the null hypothesis, the product is chi-square with 2n degrees of freedom, where n is the number of terms making up the product. So the inverse chi-square function gives the probability associated with that product. --Gary -- Gary Robinson CEO Transpose, LLC grobinson@transpose.com 207-942-3463 http://www.emergentmusic.com http://radio.weblogs.com/0101454
No no no, I had that wrong. Silly of me. Sorry. It's been too long since I did that calc. It's not the product of the p's that is a chi-square distribution, it's the following, given p1, p2..., pn: 2*((ln p1) + (ln p2) + ... + (ln pn)) That expression has a chi-square distribution with 2n degrees of freedom. So you feed THAT into the inverse-chi square function to get a p-value. Let invchi(x, f), where s is the random variable and f is the degrees of freedom, be the inverse chi square function. Let S be a number near 1 when the email looks spammy and H be a number near 1 when the email looks hammy. Then you want S = 1 - invchi(2*((ln (1-p1)) + (ln (1-p2)) + ... + (ln (1-pn)), 2*n) and H = 1 - invchi(2*((ln p1) + (ln p2) + ... + (ln pn)), 2*n) I am a little out-of-practice but I am about 99.9% sure that the above is right. I looked up some of my notes from a few years ago to get the calc. --Gary -- Gary Robinson CEO Transpose, LLC grobinson@transpose.com 207-942-3463 http://www.emergentmusic.com http://radio.weblogs.com/0101454
[Gary Robinson]
... It's not the product of the p's that is a chi-square distribution, it's the following, given p1, p2..., pn:
2*((ln p1) + (ln p2) + ... + (ln pn))
That expression has a chi-square distribution with 2n degrees of freedom.
I haven't found a reference to this online, and don't have reasonable access to a good technical library, so I need your help to get this straight. The first thing that strikes me is that it can't be quite right <wink>: a chi-squared statistic is positive by its very nature, but the expression is a sum of logs of values in (0., 1), so is necessarily negative. Here's the chi-squared function I'm using: """ import math as _math def chi2Q(x2, v, exp=_math.exp): """Return prob(chisq >= x2, with v degrees of freedom). v must be even. """ assert v & 1 == 0 m = x2 / 2.0 sum = term = exp(-m) for i in range(1, v//2): term *= m / i sum += term return sum """ It's an especially simple and numerically stable calculation when v is even, and v always is even if the formulation is right. I understand I could save time with tabulated values, but speeding this is premature. Example:
chi2Q(129.561, 100) 0.025000686582048785
Abramowitz & Stegun give 129.561 as the 0.025 point for the chi-squared distribution with 100 degrees of freedom. Etc. I'm confident *that* function is working correctly.
So you feed THAT into the inverse-chi square function to get a p-value.
I'm feeding its negation in instead, since the correct result would be 1.0 for any negative x2 input.
Let invchi(x, f), where s is the random variable and f is the degrees of freedom, be the inverse chi square function. Let S be a number near 1 when the email looks spammy and H be a number near 1 when the email looks hammy.
Then you want
S = 1 - invchi(2*((ln (1-p1)) + (ln (1-p2)) + ... + (ln (1-pn)), 2*n)
and
H = 1 - invchi(2*((ln p1) + (ln p2) + ... + (ln pn)), 2*n)
OK, I believe I'm doing that, but multiplying the first argument by -2 instead of 2: """ from Histogram import Hist from random import random import sys h = Hist(20, lo=0.0, hi=1.0) def judge(ps, ln=_math.log): H = S = 0.0 for p in ps: S += ln(1.0 - p) H += ln(p) n = len(ps) S = 1.0 - chi2Q(-2.0 * S, 2*n) H = 1.0 - chi2Q(-2.0 * H, 2*n) return S/(S+H) warp = 0 if len(sys.argv) > 1: warp = int(sys.argv[1]) for i in range(5000): ps = [random() for j in range(50)] p = judge(ps + [0.99] * warp) h.add(p) print "Result for random vectors of 50 probs, +", warp, "forced to 0.99" print h.display() """ Note: as usual, scaling (S-H)/(S+H) from [-1, 1] into [0, 1] is ((S-H)/(S+H) + 1)/2 = ((S-H+S+H)/(S+H))/2 = (2*S/(S+H))/2 = S/(S+H) The bad(?) news is that, on random inputs, this is all over the map: Result for random vectors of 50 probs, + 0 forced to 0.99 5000 items; mean 0.50; sdev 0.27 -> <stat> min 0.000219435; median 0.49027; max 0.999817 * = 6 items 0.00 206 *********************************** 0.05 215 ************************************ 0.10 209 *********************************** 0.15 240 **************************************** 0.20 282 *********************************************** 0.25 239 **************************************** 0.30 270 ********************************************* 0.35 289 ************************************************* 0.40 276 ********************************************** 0.45 325 ******************************************************* 0.50 291 ************************************************* 0.55 300 ************************************************** 0.60 278 *********************************************** 0.65 267 ********************************************* 0.70 234 *************************************** 0.75 234 *************************************** 0.80 211 ************************************ 0.85 207 *********************************** 0.90 213 ************************************ 0.95 214 ************************************ I don't think it's uniformly distributed (across many runs, the small peakedness near the midpoint persists), but it's close. The better news is that it's indeed very sensitive to bias (perhaps that's *why* all-random data scores all over the map?): Result for random vectors of 50 probs, + 1 forced to 0.99 5000 items; mean 0.59; sdev 0.24 -> <stat> min 0.00175781; median 0.596673; max 0.999818 * = 7 items 0.00 49 ******* 0.05 94 ************** 0.10 119 ***************** 0.15 109 **************** 0.20 153 ********************** 0.25 185 *************************** 0.30 214 ******************************* 0.35 253 ************************************* 0.40 286 ***************************************** 0.45 338 ************************************************* 0.50 344 ************************************************** 0.55 381 ******************************************************* 0.60 369 ***************************************************** 0.65 340 ************************************************* 0.70 325 *********************************************** 0.75 315 ********************************************* 0.80 292 ****************************************** 0.85 285 ***************************************** 0.90 255 ************************************* 0.95 294 ****************************************** Result for random vectors of 50 probs, + 2 forced to 0.99 5000 items; mean 0.66; sdev 0.21 -> <stat> min 0.0214171; median 0.667916; max 0.9999 * = 7 items 0.00 4 * 0.05 17 *** 0.10 45 ******* 0.15 50 ******** 0.20 58 ********* 0.25 103 *************** 0.30 137 ******************** 0.35 181 ************************** 0.40 259 ************************************* 0.45 324 *********************************************** 0.50 372 ****************************************************** 0.55 377 ****************************************************** 0.60 412 *********************************************************** 0.65 427 ************************************************************* 0.70 345 ************************************************** 0.75 369 ***************************************************** 0.80 379 ******************************************************* 0.85 370 ***************************************************** 0.90 376 ****************************************************** 0.95 395 ********************************************************* Result for random vectors of 50 probs, + 10 forced to 0.99 5000 items; mean 0.88; sdev 0.13 -> <stat> min 0.494068; median 0.922177; max 1 * = 33 items 0.00 0 0.05 0 0.10 0 0.15 0 0.20 0 0.25 0 0.30 0 0.35 0 0.40 0 0.45 1 * 0.50 84 *** 0.55 139 ***** 0.60 172 ****** 0.65 257 ******** 0.70 282 ********* 0.75 326 ********** 0.80 369 ************ 0.85 560 ***************** 0.90 800 ************************* 0.95 2010 ************************************************************* Result for random vectors of 50 probs, + 20 forced to 0.99 5000 items; mean 0.97; sdev 0.06 -> <stat> min 0.543929; median 0.996147; max 1 * = 69 items 0.00 0 0.05 0 0.10 0 0.15 0 0.20 0 0.25 0 0.30 0 0.35 0 0.40 0 0.45 0 0.50 1 * 0.55 4 * 0.60 12 * 0.65 25 * 0.70 32 * 0.75 48 * 0.80 128 ** 0.85 203 *** 0.90 383 ****** 0.95 4164 ************************************************************* I won't bother to show it here, but the histograms are essentially mirror images if I force the bias value to 0.01 instead of to 0.99. Does the near-uniform spread of "scores" on wholly random inputs strike you as sane or insane? I don't understand the theoretical underpinnings of this test, so can't really guess. It did surprise me -- I guess I expected strong clustering at 0.5.
Regardless of whether the chi-squared code makes sense, I whipped up another spamprob() variant to use it, and checked it in. There's a new option: [Classifier] use_chi_squared_combining: False This is yet another alternative to use_tim_combining (by the way, offline Gary and I agreed that tim_combining isn't biased, but are still butting heads over whether it's actually just a trivial transformation of Gary-combining <wink>; scores from each are always on the same *side* of 0.5, but tim-combining scores are always at least as far from 0.5 as Gary-combining scores, and usually significant farther -- that's why the spread increases so dramatically). Small test run, 10-fold CV with 400+400 in each set. As usual when switching combining schemes, the "won/lost" things don't make sense for the "after" run, because the appropriate value for spam_cutoff changes. The before run is all-default, the after run just setting the new option true: -> <stat> tested 400 hams & 400 spams against 3600 hams & 3600 spams [ditto 19 times] false positive percentages 0.000 0.000 tied 0.000 0.250 lost +(was 0) 0.000 0.250 lost +(was 0) 0.000 0.000 tied 0.250 0.500 lost +100.00% 0.000 0.250 lost +(was 0) 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied won 0 times tied 6 times lost 4 times total unique fp went from 1 to 5 lost +400.00% mean fp % went from 0.025 to 0.125 lost +400.00% false negative percentages 0.000 0.000 tied 0.250 0.000 won -100.00% 0.000 0.000 tied 0.250 0.250 tied 0.250 0.000 won -100.00% 0.500 0.250 won -50.00% 0.000 0.000 tied 0.250 0.000 won -100.00% 0.500 0.250 won -50.00% 0.000 0.000 tied won 5 times tied 5 times lost 0 times total unique fn went from 8 to 3 won -62.50% mean fn % went from 0.2 to 0.075 won -62.50% ham mean ham sdev 27.29 0.49 -98.20% 5.80 3.68 -36.55% 27.62 0.62 -97.76% 5.57 4.91 -11.85% 27.25 0.66 -97.58% 5.52 5.40 -2.17% 27.75 0.25 -99.10% 5.36 2.39 -55.41% 27.47 0.84 -96.94% 6.07 6.78 +11.70% 27.65 0.78 -97.18% 5.84 4.68 -19.86% 28.00 0.75 -97.32% 5.85 4.41 -24.62% 27.44 0.29 -98.94% 5.35 2.47 -53.83% 27.55 0.36 -98.69% 5.31 2.66 -49.91% 27.95 0.68 -97.57% 5.85 4.37 -25.30% ham mean and sdev for all runs 27.60 0.57 -97.93% 5.66 4.39 -22.44% spam mean spam sdev 82.89 99.96 +20.59% 7.17 0.48 -93.31% 82.11 99.84 +21.59% 7.04 2.11 -70.03% 81.34 99.93 +22.85% 7.30 0.79 -89.18% 81.73 99.84 +22.16% 7.38 2.66 -63.96% 82.07 99.85 +21.66% 6.78 1.85 -72.71% 82.02 99.70 +21.56% 7.32 3.28 -55.19% 82.03 99.91 +21.80% 7.05 1.27 -81.99% 82.22 99.93 +21.54% 6.75 0.73 -89.19% 82.14 99.70 +21.38% 7.50 3.27 -56.40% 82.30 99.92 +21.41% 7.30 0.84 -88.49% spam mean and sdev for all runs 82.08 99.86 +21.66% 7.17 2.00 -72.11% ham/spam mean difference: 54.48 99.29 +44.81 Stare at what happened to the means, and it's easy to see that this is more Graham-like in its score distribution than anything we've seen since using Graham-combining: -> <stat> Ham scores for all runs: 4000 items; mean 0.57; sdev 4.39 -> <stat> min -2.22045e-013; median 8.33096e-009; max 100 Check out the median there: that's extreme. Note that one ham scored 1.0! That's the Nigerian-scam quote, and I don't care because it's hopeless. It actually scored 0.999999988294. * = 63 items 0.0 3813 ************************************************************* 0.5 32 * 1.0 18 * 1.5 13 * 2.0 6 * 2.5 5 * 3.0 3 * 3.5 4 * 4.0 7 * 4.5 7 * 5.0 8 * 5.5 2 * 6.0 2 * 6.5 3 * 7.0 2 * 7.5 3 * 8.0 4 * 8.5 0 9.0 4 * 9.5 2 * 10.0 2 * 10.5 0 11.0 2 * 11.5 1 * 12.0 1 * 12.5 1 * 13.0 2 * 13.5 1 * 14.0 1 * 14.5 1 * 15.0 1 * 15.5 2 * 16.0 1 * 16.5 3 * 17.0 1 * 17.5 1 * 18.0 3 * 18.5 0 19.0 1 * 19.5 0 20.0 1 * 20.5 1 * 21.0 0 21.5 1 * 22.0 0 22.5 0 23.0 1 * 23.5 0 24.0 0 24.5 0 25.0 0 25.5 2 * 26.0 2 * 26.5 0 27.0 1 * 27.5 0 28.0 0 28.5 1 * 29.0 1 * 29.5 2 * 30.0 0 30.5 0 31.0 1 * 31.5 0 32.0 0 32.5 0 33.0 0 33.5 0 34.0 1 * 34.5 0 35.0 0 35.5 0 36.0 1 * 36.5 3 * 37.0 2 * 37.5 0 38.0 0 38.5 0 39.0 0 39.5 2 * 40.0 0 40.5 1 * 41.0 1 * 41.5 0 42.0 0 42.5 0 43.0 0 43.5 0 44.0 0 44.5 1 * 45.0 0 45.5 1 * 46.0 0 46.5 0 47.0 0 47.5 1 * 48.0 0 48.5 0 49.0 2 * 49.5 0 50.0 0 50.5 0 51.0 0 51.5 1 * 52.0 0 52.5 0 53.0 0 53.5 0 54.0 0 54.5 1 * 55.0 1 * haven't seen this get a high score since using bigrams; 55.5 0 it's someone putting together a Python user group; 56.0 0 "fully functional", etc -- accidental spam phrases 56.5 0 57.0 0 57.5 0 58.0 0 58.5 0 59.0 0 59.5 0 60.0 0 60.5 0 61.0 0 61.5 0 62.0 0 62.5 0 63.0 1 * "If you are interested in saving money ...": someone looking 63.5 0 to share a hotel room at a Python conference, but neglecting 64.0 0 to mention it *is* a Python conference 64.5 0 65.0 0 65.5 0 66.0 0 66.5 0 67.0 0 67.5 0 68.0 0 68.5 0 69.0 0 69.5 0 70.0 0 70.5 1 * this is a disturbing fp -- it's not spammish at all; 71.0 0 someone looking for help writing a webmasterish program; 71.5 0 lots of accidental high-spamprob words 72.0 0 72.5 0 73.0 0 73.5 0 74.0 0 74.5 0 75.0 0 75.5 0 76.0 0 76.5 1 * "TOOLS Europe 2000" conference announcement 77.0 0 ... 99.5 1 * Nigerian-scam quote -> <stat> Spam scores for all runs: 4000 items; mean 99.86; sdev 2.00 -> <stat> min 46.9565; median 100; max 100 * = 65 items Note that the *median* is 100: that's extreme. ... 46.5 1 * "Hello, my Name is BlackIntrepid" 47.0 0 47.5 0 48.0 0 48.5 0 49.0 0 49.5 0 50.0 0 50.5 0 51.0 0 51.5 0 52.0 1 * "Website Programmers Available Now!"; lots of tech terms 52.5 0 53.0 0 53.5 0 54.0 1 * This one slays me. It has this meta tag we ignore: <meta name="keywords" content"free stuff, get paid for being online, make money on the internet, computer jobs, home makers, get paid to surf, mlm, work at home, yes you can. for time spent surfing the internet, everything is free, no obligation, money, FREE, ... and on and on. It also has this tag we ignore: <meta name="Classification" content="free money, mlm, paid to surf, home base business, home base businesses, free money, online"> It's may be the most obvious spam ever created <wink>. 54.5 0 55.0 0 55.5 0 56.0 0 56.5 0 57.0 0 57.5 0 58.0 0 58.5 0 59.0 1 * 59.5 0 60.0 2 * 60.5 0 61.0 0 61.5 0 62.0 0 62.5 0 63.0 0 63.5 0 64.0 0 64.5 0 65.0 0 65.5 0 66.0 0 66.5 0 67.0 0 67.5 0 68.0 0 68.5 0 69.0 1 * 69.5 0 70.0 0 70.5 0 71.0 0 71.5 0 72.0 0 72.5 0 73.0 0 73.5 0 74.0 0 74.5 0 75.0 1 * If spam_cutoff had been here, it would have matched the 8 FN from the "before" run, and would have left only the Nigerian-scam and TOOLS annoucement as f-p. 75.5 0 76.0 0 76.5 0 And if spam_cutoff had been here, the wretched TOOLS announcement would have gotten thru too (sorry, but that annoucement is spam in my eyes) 77.0 1 * 77.5 0 78.0 0 78.5 0 79.0 0 79.5 0 80.0 1 * 80.5 0 81.0 0 81.5 0 82.0 0 82.5 0 83.0 0 83.5 0 84.0 0 84.5 0 85.0 0 85.5 1 * 86.0 0 86.5 1 * 87.0 0 87.5 0 88.0 3 * 88.5 1 * 89.0 2 * 89.5 0 90.0 0 90.5 0 91.0 0 91.5 1 * 92.0 2 * 92.5 0 93.0 3 * 93.5 0 94.0 2 * 94.5 0 95.0 0 95.5 1 * 96.0 1 * 96.5 3 * 97.0 2 * 97.5 1 * 98.0 4 * 98.5 6 * 99.0 3 * 99.5 3953 ************************************************************* Looks promising, albeit uncomfortably extreme. There's a huge and sparsely populated middle ground where all the mistakes live, except for the hopeless Nigerian scam quote. Example: if we called everything from 50 thru 80 "the middle ground", that easily contains all but the Nigerian mistake, yet contains only 6 (of 4000 total) ham and only 8 (of 4000 total) spam. So in a manual-review system, this combines all the desirable properties: 1. Very little is kicked out for review. 2. There are high error rates among the msgs kicked out for review. 3. There are unmeasurably low error rates among the msgs not kicked out for review. Feel encouraged to try this if you like, but keep in mind that the *point* here is how useful the middle ground may be -- just pasting in f-p and f-n rates without analysis (== staring at the mistakes and thinking about them) won't help (unless they're both disasters). It may be wise to wait for Gary to look over my previous questions about the math -- I can't swear the implementation even makes sense at this point.
OK! Gary and I exchanged info offline, and I believe the implementation of use_chi_squared_combining matches his intent for it.
... Example: if we called everything from 50 thru 80 "the middle ground", ... in a manual-review system, this combines all the desirable properties:
1. Very little is kicked out for review.
2. There are high error rates among the msgs kicked out for review.
3. There are unmeasurably low error rates among the msgs not kicked out for review.
On my full 20,000 ham + 14,000 spam test, and with spam_cutoff 0.70, this got 3 FP and 11 FN in a 10-fold CV run, compared to 2 FP and 11 FN under the all-default scheme with the very touchy spam_cutoff. The middle ground is the *interesting* thing, and it's like a laser beam here (yippee!). In the "50 thru 80" range guessed at above, 1. 12 of 20,000 hams lived there, 1 of the FPs among them (scoring 0.737). The other 2 FP scored 0.999999929221 (Nigerian scam quote) and 0.972986477986 (lady with the short question and long obnoxious employer-generated SIG). I don't believe any usable scheme will ever call those ham, though, or put them in a middle ground without greatly bloating the middle ground with correctly classified messages. 2. 14 of 14,000 spams lived there, including 8 (yowza!) of the 11 FN (with 3 scores a bit above 0.5, 1 near 0.56, 1 near 0.58, 1 near 0.61, 1 near 0.63, and 1 near 0.68). The 3 remaining spam scored below 0.50: 0.35983017036 "Hello, my Name is BlackIntrepid" Except that it contained a URL and an invitation to visit it, this could have been a poorly written c.l.py post explaining a bit about hackers to newbies (and if you don't think there are plenty of those in my ham, you don't read c.l.py <wink>). 0.39570232415 The embarrassing "HOW TO BECOME A MILLIONAIRE IN WEEKS!!" spam, whose body consists of a uuencoded text file we throw away unlooked at. (This is quite curable, but I doubt it's worth the bother -- at least until spammers take to putting everything in uuencoded text files!) 0.499567195859 (about as close to "middle ground" cutoff as can be) A giant (> 20KB) base64-encoded plain text file. I've never bothered to decode this to see what it says; like the others, though, it's been a persistent FN under all schemes. Note that we do decode this; I've always assumed it's of the "long, chatty, just-folks" flavor of tech spam that's hard to catch; the list of clues contains "cookies", "editor", "ms-dos", "backslashes", "guis", "commands", "folder", "dumb", "(well,", "cursor", and "trick" (a spamprob 0.00183748 word!). For my original purpose of looking at a scheme for c.l.py traffic, this has become the clear leader among all schemes: while it's more extreme than I might like, it made very few errors, and a miniscule middle ground (less than 0.08% of all msgs) contains 64+% of all errors. 3 FN would survive, and 2 FP, but I don't expect that any usable scheme could do better on this data. Note that Graham combining was also very extreme, but had *no* usable middle ground on this data: all mistakes had scores of almost exactly 0.0 or almost exactly 1.0 (and there were more mistakes). How does it do for you? An analysis like the above is what I'm looking for, although it surely doesn't need to be so detailed. Here's the .ini file I used: """ [Classifier] use_chi_squared_combining: True [TestDriver] spam_cutoff: 0.70 nbuckets: 200 best_cutoff_fp_weight: 10 show_false_positives: True show_false_negatives: True show_best_discriminators: 50 show_spam_lo = 0.40 show_spam_hi = 0.80 show_ham_lo = 0.40 show_ham_hi = 0.80 show_charlimit: 100000 """ Your best spam_cutoff may be different, but the point to this exercise isn't to find the best cutoff, it's to think about the middle ground. Note that I set show_{ham,spam}_{lo,hi} to values such that I would see every ham and spam that lived in my presumed middle ground of 0.50-0.80, plus down to 0.40 on the low end. I also set show_charlimit to a large value so that I'd see the full text of each such msg. Heh: My favorite: Data/Ham/Set7/51781.txt got overall score 0.485+, close to the middle ground cutoff. It's a msg I posted 2 years ago to the day (12 Sep 2000), and consists almost entirely of a rather long transcript of part of the infamous Chicago Seven trial: http://www.law.umkc.edu/faculty/projects/ftrials/Chicago7/chicago7.html I learned two things from this <wink>: 1. There are so many unique lexical clues when I post a thing, I can get away with posting anything. 2. "tyranny" is a spam clue, but "nazi" a ham clue: prob('tyranny') = 0.850877 prob('nazi') = 0.282714 leaving-lexical-clues-amid-faux-intimations-of-profundity-ly y'rs - tim
This sounds like it's working out pretty well! If we get to the point that it becomes the accepted technique for spambayes, I'll add it to the my online essay. NOTE: As we've discussed ad nauseum, this multipicative thing is one-sided in its sensitivity, which is why we end up having to do something like S/(S+H) where S is based on (1-p) calcs for combining the p's and H is based on p calcs. There ARE meta-analytical ways of combining the p-values which are equally sensitive on both sides... but are a TAD overall less sensitive than the chi-square thing. And frankly, the S/(S+H)-style trick may take away a lot of that super-strength super sensitivity anyway -- maybe even all of the advantage over other methods (I just don't know without directly testing it). So a two-sided combining approach may perform equally well for our practical purposes... there's no way of knowing without trying. The advantage of such an approach would essentially be algorithmic elegance. No longer would we need that klugy (P-Q)/(P+Q) or S/(S+H) stuff which doesn't convert to a real probability. Instead, the combined P would be all we would need. Combined P near 1 would be spammy, and combined P near 0 would by hammy. And P would be a REAL probability (against the null hypothesis of randomness). I wouldn't expect any performance ADVANTAGE to this other approach, but it WOULD be more elegant. (Note, all these approaches depend on one or another statistical function as the current one does the inverse-chi-square). If you are interested in going that way let me know, and I'll send info on how to do it. Maybe you'll have another beautifully simple algorithm up your sleave to implement the necessary statistical function. --Gary -- Gary Robinson CEO Transpose, LLC grobinson@transpose.com 207-942-3463 http://www.emergentmusic.com http://radio.weblogs.com/0101454
From: Tim Peters <tim.one@comcast.net> Date: Sat, 12 Oct 2002 02:27:29 -0400 To: SpamBayes <spambayes@python.org> Cc: Gary Robinson <grobinson@transpose.com> Subject: RE: [Spambayes] spamprob combining
OK! Gary and I exchanged info offline, and I believe the implementation of use_chi_squared_combining matches his intent for it.
... Example: if we called everything from 50 thru 80 "the middle ground", ... in a manual-review system, this combines all the desirable properties:
1. Very little is kicked out for review.
2. There are high error rates among the msgs kicked out for review.
3. There are unmeasurably low error rates among the msgs not kicked out for review.
On my full 20,000 ham + 14,000 spam test, and with spam_cutoff 0.70, this got 3 FP and 11 FN in a 10-fold CV run, compared to 2 FP and 11 FN under the all-default scheme with the very touchy spam_cutoff. The middle ground is the *interesting* thing, and it's like a laser beam here (yippee!). In the "50 thru 80" range guessed at above,
1. 12 of 20,000 hams lived there, 1 of the FPs among them (scoring 0.737). The other 2 FP scored 0.999999929221 (Nigerian scam quote) and 0.972986477986 (lady with the short question and long obnoxious employer-generated SIG). I don't believe any usable scheme will ever call those ham, though, or put them in a middle ground without greatly bloating the middle ground with correctly classified messages.
2. 14 of 14,000 spams lived there, including 8 (yowza!) of the 11 FN (with 3 scores a bit above 0.5, 1 near 0.56, 1 near 0.58, 1 near 0.61, 1 near 0.63, and 1 near 0.68). The 3 remaining spam scored below 0.50:
0.35983017036 "Hello, my Name is BlackIntrepid" Except that it contained a URL and an invitation to visit it, this could have been a poorly written c.l.py post explaining a bit about hackers to newbies (and if you don't think there are plenty of those in my ham, you don't read c.l.py <wink>).
0.39570232415 The embarrassing "HOW TO BECOME A MILLIONAIRE IN WEEKS!!" spam, whose body consists of a uuencoded text file we throw away unlooked at. (This is quite curable, but I doubt it's worth the bother -- at least until spammers take to putting everything in uuencoded text files!)
0.499567195859 (about as close to "middle ground" cutoff as can be) A giant (> 20KB) base64-encoded plain text file. I've never bothered to decode this to see what it says; like the others, though, it's been a persistent FN under all schemes. Note that we do decode this; I've always assumed it's of the "long, chatty, just-folks" flavor of tech spam that's hard to catch; the list of clues contains "cookies", "editor", "ms-dos", "backslashes", "guis", "commands", "folder", "dumb", "(well,", "cursor", and "trick" (a spamprob 0.00183748 word!).
For my original purpose of looking at a scheme for c.l.py traffic, this has become the clear leader among all schemes: while it's more extreme than I might like, it made very few errors, and a miniscule middle ground (less than 0.08% of all msgs) contains 64+% of all errors. 3 FN would survive, and 2 FP, but I don't expect that any usable scheme could do better on this data. Note that Graham combining was also very extreme, but had *no* usable middle ground on this data: all mistakes had scores of almost exactly 0.0 or almost exactly 1.0 (and there were more mistakes).
How does it do for you? An analysis like the above is what I'm looking for, although it surely doesn't need to be so detailed. Here's the .ini file I used:
""" [Classifier] use_chi_squared_combining: True
[TestDriver] spam_cutoff: 0.70
nbuckets: 200 best_cutoff_fp_weight: 10
show_false_positives: True show_false_negatives: True show_best_discriminators: 50 show_spam_lo = 0.40 show_spam_hi = 0.80 show_ham_lo = 0.40 show_ham_hi = 0.80 show_charlimit: 100000 """
Your best spam_cutoff may be different, but the point to this exercise isn't to find the best cutoff, it's to think about the middle ground. Note that I set
show_{ham,spam}_{lo,hi}
to values such that I would see every ham and spam that lived in my presumed middle ground of 0.50-0.80, plus down to 0.40 on the low end. I also set show_charlimit to a large value so that I'd see the full text of each such msg.
Heh: My favorite: Data/Ham/Set7/51781.txt got overall score 0.485+, close to the middle ground cutoff. It's a msg I posted 2 years ago to the day (12 Sep 2000), and consists almost entirely of a rather long transcript of part of the infamous Chicago Seven trial:
http://www.law.umkc.edu/faculty/projects/ftrials/Chicago7/chicago7.html
I learned two things from this <wink>:
1. There are so many unique lexical clues when I post a thing, I can get away with posting anything.
2. "tyranny" is a spam clue, but "nazi" a ham clue:
prob('tyranny') = 0.850877 prob('nazi') = 0.282714
leaving-lexical-clues-amid-faux-intimations-of-profundity-ly y'rs - tim
Tim Peters wrote:
"Tim combining" simply takes the geometric mean of the spamprobs as a measure of spamminess S, and the geometric mean of 1-spamprob as a measure of hamminess H, then returns S/(S+H) as "the score". This is well-behaved when fed random, uniformly distributed probabilities, but isn't reluctant to let an overwhelming number of extreme clues lead it to an extreme conclusion (although you're not going to see it give Graham-like 1e-30 or 1.0000000000000 scores).
While reading this I had a sudden thought: With the distributions I'm normally interested in, I want to explain the "bulk" accurately, without being extremely sensitive to the tails. e.g. in my previous job, the bulk was a database of protein structures, and I wanted to describe the bulk so that I could recognize the outliers. In my current job, the population is pixel activity on a CCD, and I don't want to be sensitive to bad pixels. The standard way to calculate a standard deviation is to calculate the mean first, and then calculate (x-<x>)^2/(n-1) in a second pass over the numbers. This is rather sensitive to outliers, however. In both cases I have experience with, the best way to describe the bulk is to use the median, and "median ways" to calculate the standard deviation. These methods absolutely ignore the extreme values. But now spambayes. The bulk are words like "the" and "with" and "want" and,.... All totally uninteresting. So if we want to be sensitive to outliers, we should "go the other way". We have two options I can think off: * use a (x-<x>)^4 function. This will be very sensitive to extremes. * calculate the mean and standard deviation both using the standard technique and using medians, and then use the DIFFERENCE between the result as a measure of the extreme-characteristic. Just some random ideas I wouldn't yet know how to apply. Rob -- Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/
participants (8)
-
Gary Robinson -
Greg Louis -
Jim Bublitz -
Neale Pickett -
Rob Hooft -
Tim Peters -
Tim Peters -
Tim Peters