[Spambayes] Moving closer to Gary's ideal

Sat, 21 Sep 2002 15:23:16 -0400

[Gary Robinson]
> Keeping the '5's is just being REALLY purist. If everything else is
> right, it shouldn't hurt.  It just falls out of really trying to be
> pure so that we can do this in a mathematical way instead of a way of
> just constantly finding little patches to little problems. In my
> experience being "relentlessly" mathematical, as opposed to just
> continually finding little "kludgy" solutions to each little aspect,
> generally pays off... that path forces you to really get the basics
> EXACTLY right, and then various theorems can be realistically invoked
> to maximize performance.

I couldn't agree more, Gary.  I ranted about the biases and "tricks" in
Paul's original formulation since the start, and have worked like hell to
eliminate them.  Every step of the way, results got demonstrably better by
doing so.  But I hit a wall when trying to out-maneuver the combination of
hambias + 2-digit prob bounds + low max_discriminators:  changing any one of
those demonstrably lost, while changing any two at a time produced mixed
results.  You've got a wonderfully clean way to nuke all 3 at once, my data
isn't arguing with it, and-- clean or not --I listen to my data most closely
of all <wink>.

> OTOH sometimes the data is just so messy to begin with that it just
> isn't possible to get there.

It's worse than that here, as this isn't passive data:  spam can change to
exploit quirks.  Simple example:  I change my spam to look like this:

    multipart-alternative
        text/html
            Here's my spam.
        text/plain
            Thousands of words' worth of utterly bland text.

The user sees the rendered HTML, but the scorer also sees the bland text,
which, if it's bland enough and there's enough of it, can pull the score
down as close to 0.5 as they care to make it.  Even simpler:

    Here's my spam.

    Here's thousands of words' worth of utterly bland text.

Some spam *already* looks like that, and even where the bland text is
randomly generated.  Spammers don't care:  they have to get your attention
at the very start; if they do, it doesn't matter what's at the end of the
message.  The combination of "smoking guns only" and "just a handful of
them" in Graham's formulation, and that 0.5-prob words make no difference at
all to the outcome (*any* pair of words whose probs sum to 1.0), seems
pretty robust against this kind of attack (to put "good" smoking guns in the
gibberish at the end requires that they know a great deal about the targets
they're sending the spam to, customizing it appropriately, and that's
expensive -- if spammers had the budget for that, they wouldn't be
bottom-feeding spammers <wink>).

> IN ANY CASE, I think it would be pragmatic at this point to have more
> experiments with small corpuses with FI1 (f(w))

Note to testers:  that's exactly what the currently checked-in code does, if
you use the suggested options.  I reviewed the code today after getting some
sleep, and found no more bugs in it -- it's computing what I intended it to
compute, which has a non-zero chance of also being what Gary wanted us to
try <wink>.

> and some kind of cutoff so that only extreme values are processed...
> experimentally, you have accrued a lot of evidence that, that does help
> under the current overall processing.

I earlier posted results from simply ignoring words with spamprob (f(w)) in
(0.4, 0.6), and that appeared to be a significant win for the f-n rate on my
large test set.  That's damned hard to achieve.  If there's a more
principled way to get there, I'm all ears, but my intuition remains that if
we try to draw subtle distinctions here, we're trying to solve the wrong
problem (e.g., if we were trying to decide whether a thing was written by
Shakespeare or Bacon, I'd be loathe to throw out the tiniest clue -- but
spam isn't subtle at all, and I think there's a danger of missing the bear's
jaws while staring at its feces with a magnifying glass <wink>).

> f(w) doesn't rely on esoteric theorems to be useful, as g(w)
> does. f(w) just MAKES SENSE.

It makes a lot of sense even from my ignorant viewpoint <wink>.  It's a
beautiful adjustment!

> It should work even if everything else isn't totally pure.
>
> I would run MAX_DISCRIMINATOR testing with respect to f(w), not
> (p(w)), cuz f(w) is more meaningful. Tim has tried both the 150 most
> extreme

Note that all my results from last night and today boosted that to 1,500 (it
was at 150 before only because I hadn't yet written code to prevent problems
with floating point underflow -- there's no limit on how high
max_discriminators can be set now).

> and the 15 most extreme...

I haven't tried that at all with f(w) yet.

> it would be interesting to see what happened with both of those
> measures using abs(f(w)) as the indicator of extremeness -- maybe
> that's what Tim is currently doing, I'm not sure now.

I'm unclear on what you mean by abs(f(w)).  f(w) is always > 0 in my
implementation; I don't see how it could become negative unless a negative
"a" or "x" parameter were fed into to it (which wouldn't seem to make any
sense).  The 2-line patch I posted earlier used

    abs(f(w) - 0.5)

as the measure of extremeness, in exactly the same way we measured "smoking
gun-ness" for Graham's p(w).

> ==>I would also experiment with the size of a. When a gets small
> (compared to 1) it will make a very different choice of words end up at
> the extremes.  a is really something that could meaningfully be tuned
> for performance.<==

If I had any performance left to gain, I'd try that <wink>.

> ---
> For those of you who may be interested in my "purist" perspective (others
> can stop reading now! ;)   ):
>
> I want to be able to invoke a particular theorem via g(w). I have done so
> before very effectively, but it may not be possible here.
>
> I can give ideas to test to bring things further in that direction, but
> it all depends on how far you guys want to go.
>
> I would love to be able to offload experimental work with regard to that
> idea from you guys and do it myself, but I can't now. I'm CEO of a
> company that is in a critical period and just have no free time. I may
> have a bit more free time beginning mid-october, but I'm not sure.

I think that, while it could be better, we've got very usable software in
place to automate almost all of a solid testing methodology.  It's quite
easy to run a test now, although there are several little pieces you have to
learn to string together in the right order, so there's definitely a
learning curve at the start.  What remains difficult for most people is
getting a good corpus of test data, and cleaning it, and classifying it,
and-- if it's a mixed source corpus --painfully learning which of the header
lines you dare not tokenize lest you get superb results for bogus reasons.

But several people here have endured that to the end, so I think we're up to
the challenge.  To test g(w), though, I first have to implement it <wink>.

> From the purist perspective, I think it would be good to look at
> the case of these conferences that are now being classified as spam,
> and see if we can see some very fundamental reason why they are now
> coming out as spams.

They're not, if I boost the cutoff to 0.575.  The "fundamental reason" is
really quite obvious:  they're trying to *sell* you on going to a
conference, and therefore use the language of advertising.  They pretty much
have to.  Tokenizing via word bigrams makes this screamingly obvious:  they
use word pairs that almost never show up in non-sales-oriented email (see
our project's TESTING.txt for many examples).  "You" and "receive" are bland
on their own, but "you receive" is a very spammish bigram.

It's an open question to me too whether or not they *should* be called ham!
Many of the conference announcements posted to comp.lang.python have no
particular relevance to that newsgroup (the ones that do are never called
spam -- they get strong brownie points for mentioning Python!), they are
unsolicited automated email, and they are trying to sell you something.

> I have some ideas which I will think about...
>
> For instance, it is not really right that the original p's are computed
> by counting ALL the occurrences of the words, even multiple occurrences
> in the same email, but then we are NOT doing classification with that
> data but rather just looking at presence vs. absence of a word.

Yes, that's a remaining asymmetry that bugs me too.  Two things I know from
running experiments before on our heavily tweaked Graham scheme;

1. It hurt to *score* a word multiple times in a msg.  The "handful of
   smoking guns" scoring meant that one unlucky word repeated multiple
   times could kill the whole message.  A classic example was the
   poor guy whose perfectly reasonable post was flagged as spam
   because he happened to include some Unix ls output, and his Unix
   name in the output-- repeated on every output line --just happened to
   have a very high spamprob.  Graham's scoring scheme latched on to
   that, and didn't look at anything else.

2. It also hurt to compute the spamprob *not* counting repetitions
   multiple times, but this was a milder effect and seemingly much
   subtler.  I may mention my penis offhandedly before this email
   is done, but the last spam I got mentioned somebody else's penis
   a dozen times.  Massive repetition is part of the "language of
   advertising" too.

Those experiments need to be redone now with f(w) and high
max_discriminators.

> Until we are very consistent about all that stuff, we may not have an
> underlying distribution that enables theorems that rely on particular
> distributions to be invoked, because everything just may be too off-
> kilter.

Understood and appreciated.

> Also there may be other problems that have to be addressed in order to
> really have a known distribution under a null hypothesis. I thought
> of some possible ones last night and will continue thinking about them.

Great!

> Some of these problems may be intrinsic to the data we're dealing
> with -- there may be no way to get around them or we may have to do
> something clever to get around them.

So far "stupid" has beat "clever" on almost every experiment, where I'm
careful not to confuse clever with principled.

> When I get clearer on the problems and possible solutions, I'll
> say so... or if there's anyone here who wants to work with me on that
> perspective, that would be great... just let me know.

Discussion is always welcome here.

> But in the meantime, I think you guys should play with f(w) and tune
> a and MAX_DISCRIMINATOR and see if you can optimise things for better
> performance than you had before, and stay away from further work
> on g(w). That's pragmatic.

Because g(w) is easy to implement, I'm going to try it.  Our source code
here is open for all to stare at, and that gives me much more confidence
that we're not suffering from "stupid bugs" when we try something -- and
it's very easy to make a mistake.  I won't believe that g(w) hurts or helps
until my tests tell me (no offense to GregL intended -- I simply have no
idea what his code does).