[Spambayes] How low can you go?

Tim Peters tim.one at comcast.net
Thu Dec 18 22:31:49 EST 2003


[Tim]
>> BTW, it should *not* be necessary to increase
>> max_discriminators, and doing so can create subtle numeric
>> problems in the inverse chi-squared function.
>> Without this option, in an N-token message, N tokens were
>> candidates for scoring; with this option, there are still
>> exactly N candidates for scoring; with a true tiling
>> implementation, there are no more than N
>> candidates for scoring (and usually less than N).

[Tony Meyer]>
> So the comment in here:
>
<http://mail.python.org/pipermail/spambayes-dev/2003-September/001005.html>
> Is only referring to cases where both unigrams *and* bigrams are used,
> rather than the tiling (or crude approximation) is used?

The first quoted comment there is the important one:  "I really wonder
what's going on here!" <wink>.  I didn't know, and was throwing out guesses.
Now that I'm using the scheme myself, I still don't know, but lots of
"interesting questions" are forming.  (Overall, the scheme is showing
excellent performance on my live email so far with much less training data
than I was using before, but there are still jaw-dropping exceptions;
studying those carefully takes time, and so far no pattern has made itself
obvious beyond that hapaxes can do the darnedest things ...)_.

The theoretical motivation for tiling is to eliminate systematic creation of
highly correlated clues.  The bad thing that results from that would be
"spectacular failures".  That doesn't appear relevant for the cases you
described there, where

    Some are nailed for me - 100% (rounded), but others are
    solidly unsure (50%ish).

A "spectacular failure" would have been if a Nigerian scam scored 0% for
you.

> I did get improvements with a higher max_discriminators:
>
<http://mail.python.org/pipermail/spambayes-dev/2003-September/001018.html>
> Is that likely to be just a side-effect of the crudeness of my
> approximation?

Well, there's not *evidence* there, just a report of what happened on one
specific message.  "Evidence" is along the lines of "and across 2,000
messages, this is what happened to the FP and FN rates in each of 10
cross-validation runs".

Going back to

http://mail.python.org/pipermail/spambayes-dev/2003-September/000998.html

where the actual spamprobs are shown, lines like

'george'                   0.0848469           9      8
'right now,'               0.0857477           7      6

tell me (perhaps with benefit of hindsight gained by later developments)
that something else screwy was happening too, starting with that while those
tokens were seen in "almost the same" number of each kind of training class,
their spamprobs were nevertheless strongly hammy.  That can only happen if
the training data is unbalanced.  If there were an equal # of ham and spam,
the 'george' line would have had spamprob 0.47, so weak the token would have
been ignored.  You would have had to have at least an 11:1 spam:ham training
ratio to get a spamprob that low, and that's assuming you didn't also have
the now-defunct experimental imbalance adjustment option obscuring it too
(if you did at the time, the training imbalance must have been worse than 12
to 1).

As you said later:

    although there are almost twice as many spam clues as ham, the ham
    ones are lower

and a strong training imbalance in favor of spam does "cheapen" spam
evidence.  As in:

'reply this'                0.945495            6   1164

That's a *low* spamprob for a token that's been seen in almost 200x as many
spam as ham, doncha think?  It would have been 0.995 if the training data
were balanced and you saw those counts.

So I still don't know what was going on with that msg, but the existence of
so strong a training imbalance makes it a much less interesting pursuit to
me now than I thought it was then.

One more:  you eventually got a "spammy enough" score by boosting
max_discriminators to 600.  Without knowing what effects that had across
*many* messages, it's just an anecdote, but I do know that the chi2Q
implementation can become numerically unreliable with max_discriminators set
that high.  That's one reason it defaults to 150.  The other is that testing
with the unigram scheme found no reason to make it even that high.  I'm not
seeing any reason in my own experiment now to suspect the mixed scheme needs
it higher, but my training data is scrupulously closely balanced, and I've
only been at it for one day.




More information about the Spambayes mailing list