[Spambayes] spamprob combining

Sat Apr 26 11:09:28 EDT 2003

[Jim Bublitz]
> Since my last msg was incomprehensible,

Not at all -- it's just that nobody knew what parts of it meant <wink>.

< I'm just going to attach my code at the bottom and refer to it.
>
> Graham's original score calculation -
>
>     product/(product + inverseProducct)
>
> does give the kind of score distribution you described.

That's not a matter of speculation, we observed that endlessly over the
first few weeks of this project's life, and the msg you're replying to
showed that it's true even when fed random data.

> If you substitute Gary Robinson's suggestion (see below - last few
> lines), the score distribution does spread out to the center a little
> bit.

It spread out enormously for us -- it was night-and-day difference.

> You can get Robinson's scoring calculation (as below) to produce a
> normal distribution around the mean ham or spam score if you
> either:
>
> a. Increase VECTOR_SIZE (max_discriminators??) - a value
> of around 100 seems to do pretty well

We use 150 by default.  The distributions look "kinda normal", but tighter
than normal toward the endpoints, and looser toward the middle.

> b. Instead of selecting the most extreme N word probabilities
> from the msg being tested, select the words randomly from the list
> of words in the msg (not shown in code below). You immediately
> (VECTOR_SIZE = 15) get a normal distribution around the means, but
> accuracy sucks until you select 75 to 100 words/msg randomly.

Hmm.  I'm not *that* much in love with a normal distribution <wink>.

> Neither (a) nor (b) works as well as the 15 most extreme words on
> my test data. Also, Robinson's calculation doesn't produce ham at
> 0.99 or spam at 0.01 - in fact the msgs that I had a hard time
> classifying manually are (mostly) the ones that fall near the
> cutoff.

Right, we call that "the middle ground" here.  It's believed to be useful,
although everyone seems to ignore it <winK>.

> Note also that the code below will produce an unpredictable score
> if the msg contains only 30 .01 words and 30 .99 words.

That can't happen for us:  I believe you're still using Graham's contrived
clamping of probabilities into [.01, .99].  We have no limits; if a word
appears only in spam, we give it a raw spamprob of 1; if only in ham, a raw
spamprob of 0; and then rely on Gary's probability adjustment to soften
those extremes in accord with how *many* messages they've been seen in.  One
nice result is that we don't suffer "cancellation disease" anymore (a factor
often implicated in bad errors when we were using Paul's scheme, where up to
hundreds each of .01 and .99 words fought to produce a useless result).

> It depends on how pairs.sort (...) handles ties.

Under Python 2.3 (CVS) the sort is stable; under previous Pythons, it's
stable "by (reliable) accident" so long as there are no more than 100
elements in the list; if more than that, it's unpredictable.

> Making the limits asymmetrical (eg .989 and .01 instead of .99/.01)
> doesn't seem to work very well.

Dump the limits entirely -- we did, and we're happier <wink>.

> The other thing that helps make the scores extreme in actual use is
> that the distribution of word probabilities is extreme. For my
> corpora using the code below I get 169378 unique tokens (from 24000
> msgs, 50% spam):
>
> Probability    Number of Tokens  % of Total
> [0.00, 0.01)       46329           27.4%  (never in spam)
> [0.99, 1.00)      104367           61.7%  (never in ham)
>                                    -----
>                                    89.1%

This gets less problematic if you dump the artificial limits.  A word that
appears only in spam gets a higher spamprob the more spams it appears in
then, and so pushes out other unique-to-one-kind words that haven't appeared
as often.

> From looking at failures (and assuming passes behave similarly) the
> 10.9% (~17000 tokens) in between 0.01 and 0.99 still do a lot of the
> work, which makes sense, since those are the most commonly used
> words.

We've gotten better results under all schemes by pretending that words with
spamprobs in (0.4, 0.6) simply don't exist.

> My experience has been that the tail tips of the score distribution
> maintain about the same distance from the mean score no matter what
> you do. If you improve the shape of the distribution (make it look
> more normal), you move the tails about the same distance as the
> distribution has spread out, and the ham and spam tails overlap
> more and more, increasing the fp/fn rates. The little testing I did
> on Spambayes (last week's CVS) seemed to show the same effect.

That jibes with my experience too.  "The problem" in my data, though, is
that some ham simply have nothing hammish about them, unless viewed in the
light of semantic knowledge this kind of scheme doesn't have; likewise the
only thing spammish about some of the spam is that people here have decided
to call them spam <wink>.

> For the code below, if I train on 8000 msgs (50% spam) and then
> test 200, retrain on those 200, and repeat for 16000 msgs, I get 4
> fns (3 are identical msgs from the same sender with different dates,
> all are Klez msgs) and 1 fp (an ISP msg "Routine Service
> Maintenance"), which are fn and fp rates of 0.05% and 0.01%. The
> failures all scored in the range [0.495, 0.511] (cuttoff at 0.50)

Whatever you're doing is certainly working well for you!

> I ran the the SA Corpus

What is this?  A SpamAssassin corpus?  Nobody has mentioned that here
before.

> today also and don't get any failures if I train on 8K of my msgs and
> 50/100 of their msgs (worse results under other conditions), but the
> sample sizes there are too small to do an adequete training sample and
> have enough test data to have confidence in the results. I can post
? those results if anyone is interested.
>
> Graham's method was basically designed to produce extreme scores,
> and the distribution of words in the data seems to reinforce that.
>
> If it's of any use to anybody (it's certainly beyond me), both the
> distribution of msg scores and distribution of word probabilities
> look like exponential or Weibull distributions. (They're "bathtub"
> curves, if anyone is familiar with reliability statistics).

>
> This is all based on my data, which is not the same as your data.
> YMMV.
>
> Jim
>
>
> # classes posted to c.l.p by Erik Max Francis
> # algorithm from Paul Graham ("A Plan for Spam")
>
> # was TOKEN_RE = re.compile(r"[a-zA-Z0-9'$_-]+")
> # changed to catch Asian charsets
> TOKEN_RE = re.compile(r"[\w'$_-]+", re.U)
> FREQUENCY_THRESHHOLD = 1 # was 5
> GOOD_BIAS = 2.0
> BAD_BIAS = 1.0
> # changed to improve distribution 'width' because
> # of smaller token count in training data
> GOOD_PROB = 0.0001 # was 0.01
> BAD_PROB = 0.9999  # was 0.99
> VECTOR_SIZE = 15
> UNKNOWN_PROB = 0.5 # was 0.4 or 0.2
>
> # remove mixed alphanumerics or strictly numeric:
> #  eg: HM6116, 555N, 1234 (also Windows98, 133t, h4X0r)
> pn1_re = re.compile (r"[a-zA-Z]+[0-9]+")
> pn2_re = re.compile (r"[0-9]+[a-zA-Z]+")
> num_re = re.compile (r"^[0-9]+")
>
>
> class Corpus(dict):
>     # instantiate one training Corpus for spam, one for ham,
>     # and then one Corpus for each test msg as msgs are tested
>     # (the msg Corpus instance is destroyed after
>     # testing the msg)
>
>     def __init__(self, data=None):
>         dict.__init__(self)
>         self.count = 0
>         if data is not None:
>             self.process(data)
>
>     # process is used to extract tokens from msg,
>     # either in building the training sample or
>     # when testing a msg (can process entire msg
>     # or one part of msg at a time)
>     # 'data' is a string
>
>     def process(self, data):
>         tokens = TOKEN_RE.findall(str (data))
>         if not len (tokens): return
>
>         # added the first 'if' in the loop to reduce
>         # total # of tokens by >75%
>         deletes = 0
>         for token in tokens:
>             if (len (token) > 20)\
>                 or (pn1_re.search (token) != None)\
>                 or (pn2_re.search (token) != None)\
>                 or (num_re.search (token) != None):
>                 deletes += 1
>                 continue
>
>             if self.has_key(token):
>                 self[token] += 1
>             else:
>                 self[token] = 1
>
>         # count tokens, not msgs
>         self.count += len (tokens) - deletes
>
>
> class Database(dict):
>     def __init__(self, good, bad):
>         dict.__init__(self)
>         self.build(good, bad)
>
>     # 'build' constructs the dict of token: probability
>     # run once after training from the ham/spam Corpus
>     # instances; the ham/spam Corpus instances can be
>     # destroyed (after saving?) after 'build' is run
>
>     def build(self, good, bad):
>         ngood = good.count
>         nbad = bad.count
> #        print ngood, nbad, float(nbad)/float(ngood)
>
>         for token in good.keys() + bad.keys(): # doubles up, but
>                                                # works
>             if not self.has_key(token):
>                 g = GOOD_BIAS*good.get(token, 0)
>                 b = BAD_BIAS*bad.get(token, 0)
>
>                 if g + b >= FREQUENCY_THRESHHOLD:
>                     # the 'min's are leftovers from counting
>                     # msgs instead of tokens for ngood, nbad
>                     goodMetric = min(1.0, g/ngood)
>                     badMetric = min(1.0, b/nbad)
>                     total = goodMetric + badMetric
>                     prob = max(GOOD_PROB,\
>                         min(BAD_PROB,badMetric/total))
>
>                     self[token] = prob
>
>     def scan(self, corpus):
>         pairs = [(token, self.get(token, UNKNOWN_PROB)) \
>             for token in corpus.keys()]
>
>         pairs.sort(lambda x, y: cmp(abs(y[1] - 0.5), abs(x[1]\
>                          - 0.5)))
>         significant = pairs[:VECTOR_SIZE]
>
>         inverseProduct = product = 1.0
>         for token, prob in significant:
>             product *= prob
>             inverseProduct *= 1.0 - prob
>
> # Graham scoring - was:
> #        return pairs, significant, product/(product +\
> #                                inverseProduct)
> # 'pairs' and 'significant' added to assist data logging, evaluation
>
> # Robinson scoring - don't know why, but this works great
>
>         n = float (len (significant)) # n could be < VECTOR_SIZE
>
>         # div by zero possible if no headers (and msg has no body)
>         try:
>             P = 1 - inverseProduct ** (1/n)
>             Q = 1 - product ** (1/n)
>             S = (1 + (P - Q)/(P + Q))/2
>         except:
>             S = 0.99
>
>         return pairs, significant, S
>
>