# [Spambayes] Tough to classify

David Shaw david at theresistance.net
Sun Apr 13 23:49:21 EDT 2003

```> I have no doubt that it was obviously ham to you, but don't accept it
> would
> have been obvious ham to humans other than you.  For example,
>

You're right of course.  The mail included this spammie bit:

certificates are available in any dollar amount from \$5 to \$5,000.
We'll deliver it via e-mail or physical mail-- so it's the perfect

> You must have many ham clues, else your *H* score wouldn't have been
> 0.98.

It had lots of strong clues for both.

> There are many ways to combine the individual word spamprobs so that
> the msg
> will come out as ham.  The trick is to do so in a way that doesn't also
> classify more spam as ham.  The combination method in spambayes is the
> end
> result of some intense work on the topic by several people, and beat
> dozen other combination methods in large tests.  That doesn't mean
> it's the
> best possible combination method, but does suggests it won't be
> trivial to
> do better.

Oh I know.  I read the math in the chi squared code on Gary's page and
quickly got in over my head.  I took some probability math classes in
college, but it's been a few years.

Maybe I just need to adjust my thresholds.  This message scored:

X-Spambayes-Spam-Probability: 0.288224866953

I have my ham threshold at .2 and my spam at .8.  Almost always when a
message is unsure it is really spam.  This time it was ham.  I think
maybe I just need to set the thresholds to .3 and .7 and see how that
goes for a while.

> The combination code (in classifier.py) is about the easiest part of
> the
> system to change, so feel encouraged to test alternatives.  "I feel
> like"
> isn't really testable on its own <wink>.

I love python for this very reason :)  If only I could figure out that
dibbler stuff -- it seems very complicated (and slow, at least on OS X)
for what it's doing.  I'd love to replace it with something simpler and
faster.

```