[Spambayes] Seeking a giant idle machine w/ a miserable corpus

Sun Nov 17 00:44:16 2002

[G. Armour Van Horn]
> I've been meaning to ask, what do "real cost" and "best cost"
> actually mean?

In Options.py:

# After the display of a ham+spam histogram pair, you can get a listing
# of all the cutoff values (coinciding with histogram bucket boundaries)
# that minimize
#
#      best_cutoff_fp_weight * (# false positives) +
#      best_cutoff_fn_weight * (# false negatives) +
#      best_cutoff_unsure_weight * (# unsure msgs)
#
# This displays two cutoffs:  hamc and spamc, where
#
#     0.0 <= hamc <= spamc <= 1.0
#
# The idea is that if something scores < hamc, it's called ham; if
# something scores >= spamc, it's called spam; and everything else is
# called 'I'm not sure' -- the middle ground.
#
# Note:  You may wish to increase nbuckets, to give this scheme more
# cutoff values to analyze.
compute_best_cutoffs_from_histograms: True
best_cutoff_fp_weight:     10.00
best_cutoff_fn_weight:      1.00
best_cutoff_unsure_weight:  0.20

So by default, an FP is charged $10, an FN $1, and an unsure $0.20.  The
best cost is the lowest cost you could possibly have gotten by choosing ham
and spam cutoffs with perfect knowledge of how things would turn out.  The
real cost is how things actually turned out, using the ham and spam cutoffs
you supplied in advance.

> I've seen you guys "spend" several million dollars while testing,
> and if it "costs" that much to test for spam in this way, I'm going
> to have a heck of a time marking it up and selling it to customers!

Relax; they're Canadian dollars <wink>.