Bland words, and z-combining (was RE: [Spambayes] Bland word only
score..)
Tim Peters
tim.one@comcast.net
Mon, 14 Oct 2002 01:42:58 -0400
[Rob Hooft]
> [Tim: the previous copy of this message I sent to you was too quick.]
Ah, replied to that privately. Bottom line:
[tail end of histograms after running looking *only* at bland words]
> -> best cutoff for all runs: 0.58
> -> with weighted total 10*0 fp + 5797 fn = 5797
> -> fp rate 0% fn rate 99.9%
The overlap is so bad that even with 200 buckets, the best the histogram
analysis could do is suggest a cutoff with a nearly 100% FN rate.
> -> <stat> Ham scores for all runs: 16000 items; mean 49.59; sdev 1.41
> -> <stat> min 40.7953; median 49.9561; max 57.7839
> -> <stat> Spam scores for all runs: 5800 items; mean 50.39; sdev 1.25
> -> <stat> min 43.2803; median 50.2241; max 59.1799
So whether ham or spam, nearly half the bland words point in the wrong
direction. It's too much like adding in coin flips for my tastes.
> I had cv5.txt : New decision criterion: prob = (S-H+1)/2
> robinson_minimum_prob_strength = 0.0
>
> Adding cv6.txt : Same as cv5 but with
> robinson_minimum_prob_strength = 0.1
>
>
> amigo[165]spambayes%% /usr/local/bin/python cvcost.py cv[56].txt
> cv5.txt: Optimal cost is $103.2 with grey zone between 49.0 and 96.0
> cv6.txt: Optimal cost is $109.0 with grey zone between 49.0 and 97.0
>
> So for me, robinson_minimum_prob_strength = 0.0 gives the best result
> yet.
It didn't help on my data:
chisq.txt: Optimal cost is $27.0 with grey zone between 50.0 and 78.0
bland.txt: Optimal cost is $28.2 with grey zone between 50.0 and 85.0
The difference is so small I can't swear it hurt, either. I think the
difference in your case is too small to be confident too.
There's *one* scheme where including the bland words helps me: there's
another option use_z_combining I haven't talked about here, which implements
another speculative idea from Gary. That one is, well, extremely extreme.
Only 16 of 20,000 ham scored over 0.50 using it, and only 3 of 14,000 spam
scored under 0.50. The 16 FP include my 2 that will never go away, and they
score 1.00000000000 and 0.999693086732 even with the bland words. BTW, in
*some* sense the z-combining score is an actual probability.
With the all-default costs, cvcost sez z-combining worked even better for me
(including all bland words):
zcomb.txt: Optimal cost is $26.8 with grey zone between 75.0 and 90.0
The difference between that and chisq.txt's $27.00 is one "not sure" msg out
of 34,000, so I'm not highly motivated to pursue it. But I encourage others
to try it -- it may work better on harder data than mine! I'll note that it
suffers its own form of "cancellation disease" (one of my very long spam
scored 0.0000000000041), which the chi-squared scheme is refreshingly free
of (that same spam scored 0.5 under chi combining).
If you want to try it, I suggest
"""
[Classifier]
use_z_combining: True
robinson_minimum_prob_strength: 0.0
[TestDriver]
nbuckets: 200
"""
I'd rather that people who haven't been playing along lately try
chi-combining, though, because as far as I'm concerned, the results so far
say it's the best scheme we've got -- and as someone else recently
suggested, it's high time to start killing off the losers again.
"""
[Classifier]
use_chi_squared_combining: True
[TestDriver]
nbuckets: 200
"""
I sped that up, BTW (it invokes log() up to 150x less often now).
Note that chi and z combining do NOT require "the third" training pass, so
cross-validation tests can be run in the default "high speed" mode
(incremental training and untraining work fine with these).