[Spambayes] Total cost analysis

Tim Peters tim.one@comcast.net
Sun, 13 Oct 2002 16:42:32 -0400


[Rob Hooft]
> I checked in a new program 'cvcost.py' that analyses the total human
> cost to you of human spam filtering based on a result of timcv.py

Very cool!  Thank you.  Everyone, note that this looks at the 'all runs' ham
and spam histograms at the end of the file, so the granularity of the
analysis is limited by your nbuckets setting.  I usually run with nbuckets
200; maybe I should boost the default to that (it's currently 40).

> The program is called cvcost.py. The default cost for an unknown message
> is set to $0.20, for a fn to $1 and for a fp to $10; these numbers can
> be changed using command line options.

I find I can make almost any scheme "the winner" by fiddling these to
extreme enough values <wink>.  In particular, by boosting the fp cost toward
infinity, the all-default scheme Rulz -- even at nbuckets 200, the extreme
schemes don't have fine enough granularity in the histograms to weed out the
one or two (depending on scheme) extremely high-scoring false positives in
my data.  But I don't actually care if the Nigerian scam quote  gets
rejected, so like all automated analyses this has to be tempered with
judgment.  It's a wonderfully useful tool then!


PS:  I'm rerunning my fat test now with your alternative S-and-H combination
scheme; I sure agree I like the effects it had in the examples you
presented; we'll see whether my data agrees too ...