[Jim Bublitz]
Since my last msg was incomprehensible,
Not at all -- it's just that nobody knew what parts of it meant <wink>. < I'm just going to attach my code at the bottom and refer to it.
Graham's original score calculation -
product/(product + inverseProducct)
does give the kind of score distribution you described.
That's not a matter of speculation, we observed that endlessly over the first few weeks of this project's life, and the msg you're replying to showed that it's true even when fed random data.
If you substitute Gary Robinson's suggestion (see below - last few lines), the score distribution does spread out to the center a little bit.
It spread out enormously for us -- it was night-and-day difference.
You can get Robinson's scoring calculation (as below) to produce a normal distribution around the mean ham or spam score if you either:
a. Increase VECTOR_SIZE (max_discriminators??) - a value of around 100 seems to do pretty well
We use 150 by default. The distributions look "kinda normal", but tighter than normal toward the endpoints, and looser toward the middle.
b. Instead of selecting the most extreme N word probabilities from the msg being tested, select the words randomly from the list of words in the msg (not shown in code below). You immediately (VECTOR_SIZE = 15) get a normal distribution around the means, but accuracy sucks until you select 75 to 100 words/msg randomly.
Hmm. I'm not *that* much in love with a normal distribution <wink>.
Neither (a) nor (b) works as well as the 15 most extreme words on my test data. Also, Robinson's calculation doesn't produce ham at 0.99 or spam at 0.01 - in fact the msgs that I had a hard time classifying manually are (mostly) the ones that fall near the cutoff.
Right, we call that "the middle ground" here. It's believed to be useful, although everyone seems to ignore it <winK>.
Note also that the code below will produce an unpredictable score if the msg contains only 30 .01 words and 30 .99 words.
That can't happen for us: I believe you're still using Graham's contrived clamping of probabilities into [.01, .99]. We have no limits; if a word appears only in spam, we give it a raw spamprob of 1; if only in ham, a raw spamprob of 0; and then rely on Gary's probability adjustment to soften those extremes in accord with how *many* messages they've been seen in. One nice result is that we don't suffer "cancellation disease" anymore (a factor often implicated in bad errors when we were using Paul's scheme, where up to hundreds each of .01 and .99 words fought to produce a useless result).
It depends on how pairs.sort (...) handles ties.
Under Python 2.3 (CVS) the sort is stable; under previous Pythons, it's stable "by (reliable) accident" so long as there are no more than 100 elements in the list; if more than that, it's unpredictable.
Making the limits asymmetrical (eg .989 and .01 instead of .99/.01) doesn't seem to work very well.
Dump the limits entirely -- we did, and we're happier <wink>.
The other thing that helps make the scores extreme in actual use is that the distribution of word probabilities is extreme. For my corpora using the code below I get 169378 unique tokens (from 24000 msgs, 50% spam):
Probability Number of Tokens % of Total [0.00, 0.01) 46329 27.4% (never in spam) [0.99, 1.00) 104367 61.7% (never in ham) ----- 89.1%
This gets less problematic if you dump the artificial limits. A word that appears only in spam gets a higher spamprob the more spams it appears in then, and so pushes out other unique-to-one-kind words that haven't appeared as often.
From looking at failures (and assuming passes behave similarly) the 10.9% (~17000 tokens) in between 0.01 and 0.99 still do a lot of the work, which makes sense, since those are the most commonly used words.
We've gotten better results under all schemes by pretending that words with spamprobs in (0.4, 0.6) simply don't exist.
My experience has been that the tail tips of the score distribution maintain about the same distance from the mean score no matter what you do. If you improve the shape of the distribution (make it look more normal), you move the tails about the same distance as the distribution has spread out, and the ham and spam tails overlap more and more, increasing the fp/fn rates. The little testing I did on Spambayes (last week's CVS) seemed to show the same effect.
That jibes with my experience too. "The problem" in my data, though, is that some ham simply have nothing hammish about them, unless viewed in the light of semantic knowledge this kind of scheme doesn't have; likewise the only thing spammish about some of the spam is that people here have decided to call them spam <wink>.
For the code below, if I train on 8000 msgs (50% spam) and then test 200, retrain on those 200, and repeat for 16000 msgs, I get 4 fns (3 are identical msgs from the same sender with different dates, all are Klez msgs) and 1 fp (an ISP msg "Routine Service Maintenance"), which are fn and fp rates of 0.05% and 0.01%. The failures all scored in the range [0.495, 0.511] (cuttoff at 0.50)
Whatever you're doing is certainly working well for you!
I ran the the SA Corpus
What is this? A SpamAssassin corpus? Nobody has mentioned that here before.
today also and don't get any failures if I train on 8K of my msgs and 50/100 of their msgs (worse results under other conditions), but the sample sizes there are too small to do an adequete training sample and have enough test data to have confidence in the results. I can post ? those results if anyone is interested.
Graham's method was basically designed to produce extreme scores, and the distribution of words in the data seems to reinforce that.
If it's of any use to anybody (it's certainly beyond me), both the distribution of msg scores and distribution of word probabilities look like exponential or Weibull distributions. (They're "bathtub" curves, if anyone is familiar with reliability statistics).
This is all based on my data, which is not the same as your data. YMMV.
Jim
# classes posted to c.l.p by Erik Max Francis # algorithm from Paul Graham ("A Plan for Spam")
# was TOKEN_RE = re.compile(r"[a-zA-Z0-9'$_-]+") # changed to catch Asian charsets TOKEN_RE = re.compile(r"[\w'$_-]+", re.U) FREQUENCY_THRESHHOLD = 1 # was 5 GOOD_BIAS = 2.0 BAD_BIAS = 1.0 # changed to improve distribution 'width' because # of smaller token count in training data GOOD_PROB = 0.0001 # was 0.01 BAD_PROB = 0.9999 # was 0.99 VECTOR_SIZE = 15 UNKNOWN_PROB = 0.5 # was 0.4 or 0.2
# remove mixed alphanumerics or strictly numeric: # eg: HM6116, 555N, 1234 (also Windows98, 133t, h4X0r) pn1_re = re.compile (r"[a-zA-Z]+[0-9]+") pn2_re = re.compile (r"[0-9]+[a-zA-Z]+") num_re = re.compile (r"^[0-9]+")
class Corpus(dict): # instantiate one training Corpus for spam, one for ham, # and then one Corpus for each test msg as msgs are tested # (the msg Corpus instance is destroyed after # testing the msg)
def __init__(self, data=None): dict.__init__(self) self.count = 0 if data is not None: self.process(data)
# process is used to extract tokens from msg, # either in building the training sample or # when testing a msg (can process entire msg # or one part of msg at a time) # 'data' is a string
def process(self, data): tokens = TOKEN_RE.findall(str (data)) if not len (tokens): return
# added the first 'if' in the loop to reduce # total # of tokens by >75% deletes = 0 for token in tokens: if (len (token) > 20)\ or (pn1_re.search (token) != None)\ or (pn2_re.search (token) != None)\ or (num_re.search (token) != None): deletes += 1 continue
if self.has_key(token): self[token] += 1 else: self[token] = 1
# count tokens, not msgs self.count += len (tokens) - deletes
class Database(dict): def __init__(self, good, bad): dict.__init__(self) self.build(good, bad)
# 'build' constructs the dict of token: probability # run once after training from the ham/spam Corpus # instances; the ham/spam Corpus instances can be # destroyed (after saving?) after 'build' is run
def build(self, good, bad): ngood = good.count nbad = bad.count # print ngood, nbad, float(nbad)/float(ngood)
for token in good.keys() + bad.keys(): # doubles up, but # works if not self.has_key(token): g = GOOD_BIAS*good.get(token, 0) b = BAD_BIAS*bad.get(token, 0)
if g + b >= FREQUENCY_THRESHHOLD: # the 'min's are leftovers from counting # msgs instead of tokens for ngood, nbad goodMetric = min(1.0, g/ngood) badMetric = min(1.0, b/nbad) total = goodMetric + badMetric prob = max(GOOD_PROB,\ min(BAD_PROB,badMetric/total))
self[token] = prob
def scan(self, corpus): pairs = [(token, self.get(token, UNKNOWN_PROB)) \ for token in corpus.keys()]
pairs.sort(lambda x, y: cmp(abs(y[1] - 0.5), abs(x[1]\ - 0.5))) significant = pairs[:VECTOR_SIZE]
inverseProduct = product = 1.0 for token, prob in significant: product *= prob inverseProduct *= 1.0 - prob
# Graham scoring - was: # return pairs, significant, product/(product +\ # inverseProduct) # 'pairs' and 'significant' added to assist data logging, evaluation
# Robinson scoring - don't know why, but this works great
n = float (len (significant)) # n could be < VECTOR_SIZE
# div by zero possible if no headers (and msg has no body) try: P = 1 - inverseProduct ** (1/n) Q = 1 - product ** (1/n) S = (1 + (P - Q)/(P + Q))/2 except: S = 0.99
return pairs, significant, S