Bland words, and z-combining (was RE: [Spambayes] Bland word only score..)

Mon, 14 Oct 2002 01:42:58 -0400

[Rob Hooft]
> [Tim: the previous copy of this message I sent to you was too quick.]

Ah, replied to that privately.  Bottom line:

[tail end of histograms after running looking *only* at bland words]
> -> best cutoff for all runs: 0.58
> ->     with weighted total 10*0 fp + 5797 fn = 5797
> ->     fp rate 0%  fn rate 99.9%

The overlap is so bad that even with 200 buckets, the best the histogram
analysis could do is suggest a cutoff with a nearly 100% FN rate.

> -> <stat> Ham scores for all runs: 16000 items; mean 49.59; sdev 1.41
> -> <stat> min 40.7953; median 49.9561; max 57.7839

> -> <stat> Spam scores for all runs: 5800 items; mean 50.39; sdev 1.25
> -> <stat> min 43.2803; median 50.2241; max 59.1799

So whether ham or spam, nearly half the bland words point in the wrong
direction.  It's too much like adding in coin flips for my tastes.

> I had cv5.txt : New decision criterion: prob = (S-H+1)/2
>                  robinson_minimum_prob_strength = 0.0
>
> Adding cv6.txt : Same as cv5 but with
>                   robinson_minimum_prob_strength = 0.1
>
>
> amigo[165]spambayes%% /usr/local/bin/python cvcost.py cv[56].txt
> cv5.txt: Optimal cost is $103.2 with grey zone between 49.0 and 96.0
> cv6.txt: Optimal cost is $109.0 with grey zone between 49.0 and 97.0
>
> So for me, robinson_minimum_prob_strength = 0.0 gives the best result
> yet.

It didn't help on my data:

chisq.txt: Optimal cost is $27.0 with grey zone between 50.0 and 78.0
bland.txt: Optimal cost is $28.2 with grey zone between 50.0 and 85.0

The difference is so small I can't swear it hurt, either.  I think the
difference in your case is too small to be confident too.

There's *one* scheme where including the bland words helps me:  there's
another option use_z_combining I haven't talked about here, which implements
another speculative idea from Gary.  That one is, well, extremely extreme.
Only 16 of 20,000 ham scored over 0.50 using it, and only 3 of 14,000 spam
scored under 0.50.  The 16 FP include my 2 that will never go away, and they
score 1.00000000000 and 0.999693086732 even with the bland words.  BTW, in
*some* sense the z-combining score is an actual probability.

With the all-default costs, cvcost sez z-combining worked even better for me
(including all bland words):

zcomb.txt: Optimal cost is $26.8 with grey zone between 75.0 and 90.0

The difference between that and chisq.txt's $27.00 is one "not sure" msg out
of 34,000, so I'm not highly motivated to pursue it.  But I encourage others
to try it -- it may work better on harder data than mine!  I'll note that it
suffers its own form of "cancellation disease" (one of my very long spam
scored 0.0000000000041), which the chi-squared scheme is refreshingly free
of (that same spam scored 0.5 under chi combining).

If you want to try it, I suggest

"""
[Classifier]
use_z_combining: True
robinson_minimum_prob_strength: 0.0

[TestDriver]
nbuckets: 200
"""

I'd rather that people who haven't been playing along lately try
chi-combining, though, because as far as I'm concerned, the results so far
say it's the best scheme we've got -- and as someone else recently
suggested, it's high time to start killing off the losers again.

"""
[Classifier]
use_chi_squared_combining: True

[TestDriver]
nbuckets: 200
"""

I sped that up, BTW (it invokes log() up to 150x less often now).

Note that chi and z combining do NOT require "the third" training pass, so
cross-validation tests can be run in the default "high speed" mode
(incremental training and untraining work fine with these).