[Spambayes] ok, i'm confused

Fri Mar 7 17:09:15 EST 2003

[Skip Montanaro]
> Here are the original X-Spambayes headers for the full-o'-spaces message:
>
>   X-Spambayes-Debug: '*H*': 0.56; '*S*': 0.47; 'subject:none': 0.05;
> ...
>   X-Spambayes-Classification: unsure; 0.46
>
> After my latest tweak to the tokenizer (ratio of spaces to total number of
> characters, after deleting leading and trailing whitespace on
> each line) and complete retraining (11k+ ham 7k+ spam), I get:
>
>   X-Spambayes-Debug: '*H*': 0.56; '*S*': 0.47; 'subject:none': 0.05;
> ...
>   X-Spambayes-Classification: spam; 0.95
>
> I've done nothing to adjust the values displayed in the X-Spambayes-Debug
> header, so all generated tokens should be displayed, and as you
> can see, all displayed tokens are the same, before and after.

I removed that part, in order to make an internal inconsistency clearer:
the overall score is

            prob = (S-H + 1.0) / 2.0

and 0.95 simply doesn't make any sense with H ~= 0.56 and S ~= 0.47.

> ...
> Why is the message now classified as spam when before is was solidly in
> the middle of unsure?

A sharper question is how (0.47-0.56 + 1.0) / 2.0 came out to be 0.95.
Answer that, and you'll know everything <wink>.