[Spambayes] Re: Move closer to Gary's ideal

Sat, 21 Sep 2002 07:05:53 -0400

Tim,

> The math in Graham's combining scheme is such that a prob 0.5 word has no
> effect whatsoever on the outcome.  The math in Gary's combining scheme
> doesn't appear to have the same property:  I believe adding a .5 prob word
> there moves the outcome closer to neutrality.

You are right, including everything does move the outcome closer to
neutrality BUT for classification purposes, the real question is simply...
is there more evidence for spam or more for ham?

Including everything lets us actually answer that question in a complete
way.

It lessen away that nice spread we've been looking at, but of course through
ranking or other means we could a nice spread back.

That is, the spread is still there, but the standard deviation is less, so
on an un-normalized graph everything looks compressed... but theoreticcally,
anything we could do with the spread before, we can still find a way to do
with the more compressed spread.

So... if we're doing well at classification this way, we will still be able
to do spread-based things like using sliders for cutoffs, or putting some
emails into a middle ground "you should look at this manually" group.

So I'll be very interested to see how well this does at binary
classification on smaller corpuses (and particularly on corpuses where the
testing and training data is different).

--Gary

-- 
Gary Robinson
CEO
Transpose, LLC
grobinson@transpose.com
207-942-3463
http://www.emergentmusic.com
http://radio.weblogs.com/0101454

> From: Tim Peters <tim.one@comcast.net>
> Date: Fri, 20 Sep 2002 23:29:19 -0400
> To: Gary Robinson <grobinson@transpose.com>, SpamBayes <spambayes@python.org>
> Cc: glouis@dynamicro.on.ca
> Subject: Move closer to Gary's ideal
> 
> I've checked some new code in for the adventurous.  To try it, you can do
> 
> """
> [Classifier]
> use_robinson_probability: True
> use_robinson_combining: True
> max_discriminators: 1500
> [TestDriver]
> spam_cutoff: 0.50
> """
> 
> "1500" was my lazy way of spelling infinity; for now, the code uses
> math.frexp() to simulate unbounded dynamic float range instead of bothering
> with logarithms; this also means the database entries are exactly the same
> as they were before.  I left max_discriminators working because I suspect
> we're going to want it again.
> 
> These options have no effect when the above is enabled:
> 
> """
> [Classifier]
> hambias
> spambias
> min_spamprob
> max_spamprob
> """
> 
> I hate all of those, so good riddance if they go <wink>.
> 
> Other options you may want to play with, but I don't recommend it unless
> you've read the source material and think you know what you're doing:
> 
> """
> [Classifier]
> # This one has no effect for now (it's easy to do, I just haven't gotten
> # to it yet).
> use_robinson_ranking: False
> 
> # The "a" parameter in Gary's prob adjustment.
> robinson_probability_a = 1.0
> 
> # Likewise the "x" parameter -- it's like our current UNKNOWN_SPAMPROB.
> robinson_probability_x = 0.5
> """
> 
> I'm still recovering from my corpus screwup and don't have a lot to say
> about this yet.  Overall it seems to be doing as well as the all-default
> scheme (our highly tuned and heavily fiddled Graham scheme)!  If it works
> better than that, I won't be able to tell from my data (the all-default
> scheme was working "too well" for me to demonstrate an improvement if one
> were made).  I did notice it nail some difficult false negatives I don't
> think the minprob/maxprob-hobbled Graham scheme would ever be able to nail.
> So all signs are good so far, except maybe one:
> 
> There's one surprising/maybe-disturbing thing I've seen on all my little
> random-subset runs (which are all I've run so far, interleaved with
> re-cleaning my corpus):  there's not only "a middle ground" now, it's
> essentially ALL "middle ground"!  Scores that's aren't due to my corpus
> pollution are virtually all within 20 points of 50.  Here's a typical
> histogram pair from a 10-fold c-v run on a random subset of 1000 ham and
> 1000 spam; the 6 lowest-scoring oddballs in the spam distro were in fact
> bogus false negatives due to my corpus screwup (so picture those dots as
> belong in the ham histogram instead):
> 
> Ham distribution for all runs:
> * = 6 items
> 0.00   0
> 2.50   0
> 5.00   0
> 7.50   0
> 10.00   0
> 12.50   0
> 15.00   0
> 17.50   0
> 20.00   0
> 22.50   0
> 25.00   0
> 27.50   6 *
> 30.00  35 ******
> 32.50  98 *****************
> 35.00 221 *************************************
> 37.50 307 ****************************************************
> 40.00 229 ***************************************
> 42.50  78 *************
> 45.00  20 ****
> 47.50   5 *
> 50.00   1 *
> 52.50   0
> 55.00   0
> 57.50   0
> 60.00   0
> 62.50   0
> 65.00   0
> 67.50   0
> 70.00   0
> 72.50   0
> 75.00   0
> 77.50   0
> 80.00   0
> 82.50   0
> 85.00   0
> 87.50   0
> 90.00   0
> 92.50   0
> 95.00   0
> 97.50   0
> 
> Spam distribution for all runs:
> * = 6 items
> 0.00   0
> 2.50   0
> 5.00   0
> 7.50   0
> 10.00   0
> 12.50   0
> 15.00   0
> 17.50   0
> 20.00   0
> 22.50   0
> 25.00   0
> 27.50   0
> 30.00   0
> 32.50   1 *
> 35.00   1 *
> 37.50   1 *
> 40.00   1 *
> 42.50   1 *
> 45.00   1 *
> 47.50   1 *
> 50.00  28 *****
> 52.50  64 ***********
> 55.00 184 *******************************
> 57.50 352 ***********************************************************
> 60.00 295 **************************************************
> 62.50  69 ************
> 65.00   1 *
> 67.50   0
> 70.00   0
> 72.50   0
> 75.00   0
> 77.50   0
> 80.00   0
> 82.50   0
> 85.00   0
> 87.50   0
> 90.00   0
> 92.50   0
> 95.00   0
> 97.50   0
> 
> I have to do other things now, but if anyone wants to play with what I
> *would* do if I could <wink>, play with max_discriminators and see whether
> reducing that helps spread this out.  A suspicion is that folding in endless
> quantities of garbage words (spamprob so close to 0.5 that they're not
> really clues at all) may be dragging everything toward 0.5 without giving a
> real benefit.
> 
> The math in Graham's combining scheme is such that a prob 0.5 word has no
> effect whatsoever on the outcome.  The math in Gary's combining scheme
> doesn't appear to have the same property:  I believe adding a .5 prob word
> there moves the outcome closer to neutrality.
>