[Spambayes] z

Gary Robinson grobinson@transpose.com
Tue Oct 15 14:38:21 2002


> It's indeed not working better for anyone so far, and it does suffer
> cancellation disease.  OTOH, it was a quick hack to get a quick feel for how
> this *kind* of approach might work, and it didn't go all the way.  Gary
> would like to "rank" the spamprobs first, but that requires another version
> of "the third training pass" that I just don't know how to make practical
> over time.

Actually I think it would be complicated or even impossible to do the way it
really *should* be done because it would have to be structured so that
spammy words always had a rank over .5 and hammy words had a rank under .5,
while the probability of hitting a spam or a ham under a reasonable null
hypothesis is the same.

It would get complicated, so I recommend not bothering trying to do it
right. I know I don't have time to try to work out a good way to do it now.


> If Rob is feeling particularly adventurous, it would be interesting (in
> conncection with z-combining) to transform the database spamprobs into
> unit-normalized zscores via his RMS black magic, as an extra step at the end
> of update_probabilities().  This wouldn't require another pass over the
> training data, would speed z-combining scoring a lot, and I *think* would
> make the inputs to this scheme much closer to what Gary would really like
> them to be (z-combining *pretends* the "extreme-word" spamprobs are normally
> distributed now; I don't have any idea how close that is to the truth).

I didn't realize that this wasn't already being done. Yes I would recommend
that somebody do this because I don't think we're really testing the z
approach completely fairly until it is.

I'm not saying I believe that the z approach will turn out to be better -- I
just don't know -- but it seems worth trying.

Gary


--Gary


-- 
Gary Robinson
CEO
Transpose, LLC
grobinson@transpose.com
207-942-3463
http://www.emergentmusic.com
http://radio.weblogs.com/0101454