[Spambayes] cancellation disease again?

Tim Peters tim.one@comcast.net
Mon Oct 21 16:37:57 2002


[Anthony Baxter]
> I think I'm seeing what's been referred to as cancellation disease again,
> using chi combining. I'm getting very very long spams (like those
> interminable MLMs with the "5 reports" that are getting both *H* and
> *S* scores at or near 1, and a final score of 0.5.
>
> E.g. the perfectly standard "send money for 5 reports" spam gets:
>
> prob = 0.500000000004

In "cancellation disease", "cancellation" does indeed refer to msgs with
huge numbers of both low-spamprob and high-spamprob words, and that's a
property of the msg in conjunction with the state of your training data --
cancellation can't be stopped.  "Diseased" refers to a scheme that infers
certainty when given such a msg.  For example, Graham-combining is diseased
in this way -- it would have scored this msg 0.0 or 1.0, and it's hard to
predict which.  chi-combining reliably scores such msgs smack in its middle
ground, which is the best that can be done -- chi is confused, and it knows
it's confused, and it tells you it's confused.

> ...
> I'm not sure what the best way to approach this is....

Middle-ground schemes *have* a middle ground -- that's their point <wink>.
You have to be aware of their middle ground.  I set up Sean/Mark's Outlook
GUI to move chi middle-ground msgs into an Unsure folder.  For python.org
use, chi middle-ground msgs will be kicked out for human review.  If you
lack a mechanism like that, I suppose the best you can do is pass them on
(if you hate FP more than FN), or call them spam (if you hate FN more than
FP), or decide that 0.000000000004 over 0.5 means spam is the best guess (if
you're determined to wish away reality <wink>).

In any case, after a correct classification is known, you should add it to
your training data.  Over time, the word spamprobs will change accordingly.
The "5 reports" spams I have in my personal-email classifier score with an
internal H of 0 and an internal S of 1, for a final score of 1.