[Spambayes] RE: Further Improvement 2

Tim Peters tim.one@comcast.net
Sat, 21 Sep 2002 17:10:43 -0400


[Gary Robinson]
> I have spent a few hours today thinking about FI2, and I've come to the
> conclusion that the nature of this data is such that we can't so easily
> do what I was trying to do with it. I had an implicit simplifying
> assumption in my thinking, and I now believe it was inappropriate to
> this real-world application.

Fair enough -- we've got plenty to test without it <wink>.

> I DO have in mind a very very rough idea of how we might get there, but
> unfortunately (for me, that is, because I really like thinking
> about this) I don't have time to pursue it further right now. I'll
> probably keep thinking about it in the background, and I should have a
> bit more time in late october, so I am personally hopeful that the
> line of thought I have in mind will bear fruit.
>
> BUT for now, I am going to simply delete FI2 from the essay, and hope
> that the ideas that have been explored so far will turn out to be
> useful.  So far, I'm happy that it looks like they may be! :)

We're probably not going to get many test results over the weekend, but it
sure looks promising to me.

> I think playing with the "a" constant in FI1 will be useful, and
> playing with max_discriminators will be useful. So I strongly
> encourage playing with both of them... I know you're more used to
> playing with max_discrinators but I think it's important to play
> with "a" too.

Since I got a big win without any effort <wink> by introducing a brand new
"ignore probs that aren't at least this far from neutral" knob, that's the
one I'm most inclined to play with right now.  There isn't a knob in
existence that won't be played with, but especially large tests take
significant wall-clock time to complete, and there's only so much testing
one can do in a day.

Testers, "a" is already exposed via:

[Classifier]
robinson_probability_a: 1.0

I think values nearer to 0 are most likely to be most interesting.

Since nobody likes thrashing in the dark, I'll give you some motivation, but
I hope Gary can take just a little more time to flesh it out:

Graham computes probabilities in a simple and pretty obvious way (btw, this
part is *much* more obvious under Gary's scheme, because there aren't any
MINCOUNT and MAX_SPAMPROB and MIN_SPAMPROB and HAMBIAS and SPAMBIAS "mystery
knobs" complicating this part any more).

Gary then does a Bayesian adjustment to that probability.  From 10,000 feet,
we *assume* that the distribution of words comes from a particular
distribution family.  The "a" in the adjustment step is a parameter that
defines a specific member of that family.  The real-word evidence we collect
by counting words is more reliable the larger the corpus and the more often
we see a given word in the corpus.  The adjustment step blends our prior
assumption about word distribution with the actual data we see, giving more
weight to the latter the more actual data about the word we have.

So this has most effect on words that appear rarely.  When I got rid of
MINCOUNT, a word that appeared exactly once in the entire training set got a
"probability" of 0.01 or 0.99 (depending on which half of the corpus it was
in).  It was just as strong a clue to Graham's scoring scheme as a word that
appeared in every one of the spams and none of the hams.  That's nuts on the
face of it, but even so it made a real improvement.  I later thrashed with
various schemes to reduce this nonsense (that's what the now-gone
adjust_probs_by_evidence_mass option was all about, btw), but they just
weren't helping consistently.

Gary has a much better scheme for doing this.  What I can't tell you is
*which* prior distribution it's assuming, or how changing "a" affects its
shape.  But since we know for sure that getting rid of MINCOUNT made a real
difference before, we also know for sure that the treatment of rare words is
important, and that's what "a" is all about.  This is the point where Gary
unexpectedly discovers he has just enough free time to tell us more about
the specifics of this distribution <wink>.

> I'll stay subscribed to the spambayes list and will contribute
> any thoughts I may happen have to have along the way, but I think now
> it's a mostly matter of tuning the parameters against test setups,
> and I don't have a test setup so I can't be of much use there!

Just don't vanish entirely.

> All the best,
>
> And thanks SO much for your willingness to play with my ideas, which I
> REALLY appreciate,

We're going to get sooooo rich from giving your ideas away too, that's all
the thanks we need <wink>.  Thank you!