[Spambayes] z-combining

Mon, 14 Oct 2002 22:52:47 -0400

[T. Alexander Popiel]
> Well, I did a z-combining run.  @whee.  It replaces my
> all-defaults run as cv1.  chi-square remains as cv2.
>
> From results.txt:

[inconsitent effects on means across runs, small and large effects on
 sdevs, but overall decreases]

> ...
> z-combining loses vs. chi-square there, with looser sdevs.

The sdevs actually got smaller overall:

> ham mean and sdev for all runs
>    0.44    0.44   +0.00%        5.90    5.65   -4.24%
>
> spam mean and sdev for all runs
>   98.50   98.47   -0.03%       10.81    9.72  -10.08%

The means are so far apart compared to the sdevs, and the extreme
concentration at the endpoints, though, that random overlap isn't an issue
with either scheme -- the mistakes these guys make are more fundamental than
random.

> Next, we have the best computations for z-combining:
>
> """
> -> best cost $54.20
> -> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20
> -> achieved at 6 cutoff pairs
> -> smallest ham & spam cutoffs 0.01 & 0.985
> ->     fp 3; fn 13; unsure ham 12; unsure spam 44
> ->     fp rate 0.15%; fn rate 0.65%; unsure rate 1.4%
> -> largest ham & spam cutoffs 0.035 & 0.985
> ->     fp 3; fn 13; unsure ham 12; unsure spam 44
> ->     fp rate 0.15%; fn rate 0.65%; unsure rate 1.4%
> """
>
> Compare with the one from chi-square:
>
> """
> -> best cost $48.00
> -> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20
> -> achieved at 3 cutoff pairs
> -> smallest ham & spam cutoffs 0.03 & 0.89
> ->     fp 3; fn 6; unsure ham 12; unsure spam 48
> ->     fp rate 0.15%; fn rate 0.3%; unsure rate 1.5%
> -> largest ham & spam cutoffs 0.03 & 0.9
> ->     fp 3; fn 6; unsure ham 12; unsure spam 48
> ->     fp rate 0.15%; fn rate 0.3%; unsure rate 1.5%
> """
>
> Looks like z-combining has real granularity problems near
> the top end.  Trash it.

It's indeed not working better for anyone so far, and it does suffer
cancellation disease.  OTOH, it was a quick hack to get a quick feel for how
this *kind* of approach might work, and it didn't go all the way.  Gary
would like to "rank" the spamprobs first, but that requires another version
of "the third training pass" that I just don't know how to make practical
over time.

If Rob is feeling particularly adventurous, it would be interesting (in
conncection with z-combining) to transform the database spamprobs into
unit-normalized zscores via his RMS black magic, as an extra step at the end
of update_probabilities().  This wouldn't require another pass over the
training data, would speed z-combining scoring a lot, and I *think* would
make the inputs to this scheme much closer to what Gary would really like
them to be (z-combining *pretends* the "extreme-word" spamprobs are normally
distributed now; I don't have any idea how close that is to the truth).  The
attraction of this scheme is that it gives a single "spam probability"
directly; combining distinct ham and spam indicators is still a bit of a
puzzle (although a happy puzzle from my POV when both indicators suck, as
happens in chi combining with large numbers of strong clues on both ends).