[Spambayes] z

Tim Peters tim.one@comcast.net
Wed Oct 16 05:06:21 2002


[Gary Robinson]
> ...
> Based on what you say in the message quoted below, I think you're
> already doing what I was hoping for, with the exception of the ranking
> part!

Me too <wink>.  If I didn't mention it before, that code snippet *does*
produce uniformly distributed outputs in [0, 1] when fed artificially
constructed vectors of uniformly-distributed random probs, so there's
nothing wrong with the theory or this implementation of it -- so far as it
goes.

> I guess I was confused by the earlier message...
>
> And I also agree that it doesn't make sense to try ranking now
> because there are aspects to this data that mean it won't come out
> to a uniform distribution under a reasonable null hypothesis
> without more tweaking than I (or, I guess, any of us) can suggest
> a way to do at this point.

More, I wouldn't see much point to it even if it were dead easy:  the chi-
and z- schemes are having no problems at all making correct extreme
judgments about ham and spam 99+% of the time.  The cases where they're
prone to mistakes mostly fall in "a middle ground", and staring at many
examples strongly suggests they're just freaking hard to classify.  It's
hard to imagine in what sense ranking (or any other probability
preconditioning) could really help here -- the mistakes aren't failures to
separate the spaces when a clear separation exists.

However, I think it may well be worth pursuing with your *original* scheme,
because that one had trouble establishing a clear boundary between ham and
spam scores, and creating "a middle ground" for it via two cutoffs ended up
capturing many more correctly classified messages than the middle grounds in
the chi- and z- schemes (although the z-scheme is so extreme that sometimes
the best spam cutoff is over 0.995!  that's in part, though, due to the
combination of wanting to avoid false positives, and that cancellation
disease sometimes gives ham very high z spam scores).