[Spambayes] Another optimization

Tim Peters tim.one@comcast.net
Wed, 18 Sep 2002 21:11:01 -0400


[T. Alexander Popiel]

[on tokenization schemes]
> ...
> Argh.  I just saw that comment block, nestled between n-grams
> and HTML discussions.  Somehow I managed to overlook it on my
> prior readings.  Sorry.

No problem, Alex -- it can be mind-numbing reading <wink>.

>> Likewise, but under the section "How big should 'a word' be?".

> Yeah, but that section only seems to refer to deciding the cutoff
> length between 12 and 13 characters, not whether there should be a
> cutoff at all.  Hrm.

I think it's safe to safe that at least a hundred large experiments have
been run that didn't get written up, or got written up only in checkin
comments that got lost when we moved the codebase from a dark corner of the
Python project.

Since I haven't reviewed this particular decision in a long time, here are
comparative results from changing the upper limit on "word" from 12 to
1200000.  This is on a 10-fold cross validation run, across 20,000 randomly
selected legit comp.lang.python messages, and essentially all of Bruce
Guentner's 2002 spam archive:

"""
-> <stat> tested 2000 hams & 1375 spams against 18000 hams & 12375 spams
-> <stat> tested 2000 hams & 1375 spams against 18000 hams & 12375 spams
-> <stat> tested 2000 hams & 1375 spams against 18000 hams & 12375 spams
-> <stat> tested 2000 hams & 1375 spams against 18000 hams & 12375 spams
-> <stat> tested 2000 hams & 1375 spams against 18000 hams & 12375 spams
-> <stat> tested 2000 hams & 1375 spams against 18000 hams & 12375 spams
-> <stat> tested 2000 hams & 1375 spams against 18000 hams & 12375 spams
-> <stat> tested 2000 hams & 1375 spams against 18000 hams & 12375 spams
-> <stat> tested 2000 hams & 1375 spams against 18000 hams & 12375 spams
-> <stat> tested 2000 hams & 1375 spams against 18000 hams & 12375 spams
-> <stat> tested 2000 hams & 1375 spams against 18000 hams & 12375 spams
-> <stat> tested 2000 hams & 1375 spams against 18000 hams & 12375 spams
-> <stat> tested 2000 hams & 1375 spams against 18000 hams & 12375 spams
-> <stat> tested 2000 hams & 1375 spams against 18000 hams & 12375 spams
-> <stat> tested 2000 hams & 1375 spams against 18000 hams & 12375 spams
-> <stat> tested 2000 hams & 1375 spams against 18000 hams & 12375 spams
-> <stat> tested 2000 hams & 1375 spams against 18000 hams & 12375 spams
-> <stat> tested 2000 hams & 1375 spams against 18000 hams & 12375 spams
-> <stat> tested 2000 hams & 1375 spams against 18000 hams & 12375 spams
-> <stat> tested 2000 hams & 1375 spams against 18000 hams & 12375 spams

false positive percentages
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.050  0.050  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.050  0.050  tied
    0.000  0.000  tied
    0.100  0.100  tied

won   0 times
tied 10 times
lost  0 times

total unique fp went from 4 to 4 tied
mean fp % went from 0.02 to 0.02 tied

false negative percentages
    0.218  0.291  lost   +33.49%
    0.364  0.436  lost   +19.78%
    0.000  0.000  tied
    0.218  0.218  tied
    0.218  0.218  tied
    0.291  0.364  lost   +25.09%
    0.218  0.291  lost   +33.49%
    0.145  0.218  lost   +50.34%
    0.291  0.364  lost   +25.09%
    0.073  0.073  tied

won   0 times
tied  4 times
lost  6 times

total unique fn went from 28 to 34 lost   +21.43%
mean fn % went from 0.203636363636 to 0.247272727273 lost   +21.43%
"""

The null hypothesis (that this changes makes no difference) can't be
rejected based on the evidence.  If anything, it *may* hurt the f-n rate a
tiny bit.  But these rates are so low relative to the corpus sizes that,
when looking at the f-n rate, 0.07% is one lousy message out of 1375, so an
increase from (for example) 0.218% to 0.291% *should* be read as "one lousy
message" <wink> instead of sa "whoa!  it got 33.5% worse!".  So, in all, I
see the same thing boosing the upper limit to 1,200,000 as I saw when
boosting it to 13:  no significant difference.  Therefore I'll keep it at
12.

>> (Perhaps not so) Strangely enough, my experiments showed that
>> never ignoring a word worked better.

> That's not something I thought to check.  Perhaps I'll try it.

I suspect Gary Robinson has a better approach to this; giving a spamprob of,
e.g., 0.99, to a word that's appeared exactly one in exactly one spam really
hurt me.  But the test results said it was a real win, so I made that change
and left theory to the theorists <wink>.

> ...
> Yeah, that's the distinction I thought I saw.  I tested it both ways
> (though not as rigorously as you do) and chose my way.

A problem is there are many more things that could and should be tested than
I can possibly make time to test.  So, like you, I pick my battles based on
staring at the mistakes our scheme makes.  I suspect Gary Robinson also has
a better approach to dealing with the minprob/maxprob artificats (Gary has
posted several times in the last two days about cool ideas that beg for
testing -- I suspect I have a better background to make this kind of scheme
really fly than Paul has, and I'm positive Gary has a better background than
I do).

> ...
> The cases that decided me had about 20-30 maximal clues, with about
> 5 or 6 unpaired .99s.  Allowing the less significant clues (ranging
> from .02 to .08 with only a couple .96-.98s) flipped them from 'spam'
> to 'ham', increasing the FN rate annoyingly.

I've only got 28 false negatives left out of 13,750 spam, and the cases
where cancellation still plays a clear role are of the extremely lopsided
nature I mentioned (a hundred .01 vs a few dozen .99).  It's quite possible
that different effects come into play with a smaller corpus, though!  I
haven't had time to test that either.

>> This "only look at smoking guns" scoring scheme seems to be
>> systematically weak when dealing with long and chatty "just folks"
>> spam (of which there is blessedly little so far <wink>).

> Indeed.  I'm at a loss for how to combat it, though, unless we
> artificially restrict the number of .01s we get through limited
> corpora.

I've got no bright ideas here either.  I've become dubious that a
statistical approach *can* catch these, unless it finds good clues in the
headers.

>> You really need multiple test runs to have confidence you're
>> making progress.

> Yeah, I really ought to be more anal about it. ;-)

Do what timcv.py does <wink>.  I don't want to take time to translate it
into Perl, but the idea is simple:  divide your data into 10 sets at random
(for some value of 10 you like ...).  Then do 10 runs, on each run leaving
one of the 10 sets out to predict on, and training on the other 9/10ths.  If
you give your classifier both "learn" and "unlearn" methods, this can be
done more efficiently than it may first appear (on run #i, unlearn set #i,
predict on set #i, and then train on set #i again).  The results across all
10 runs will then give you good evidence about how sensitive ideas are to
statistical quirks, although all runs are still taken from the same corpus.