[Spambayes] 5% points in statistics

Tim Peters tim.one@comcast.net
Fri Oct 18 07:59:41 2002


Inspired by Rob's patch, there's a new option:

"""
[TestDriver]
# Histogram analysis also displays percentiles.  For each percentile p
# in the list, the score S such that p% of all scores are <= S is given.
# Note that percentile 50 is the median, and is displayed (along with the
# min score and max score) independent of this option.
percentiles: 5 25 75 95
"""

Example output from the starts of histogram displays:

-> <stat> Ham scores for all runs: 100 items; mean 6.23; sdev 16.47
-> <stat> min 2.51688e-008; median 0.19102; max 85.9665
-> <stat> percentiles: 5% 0.000538997; 25% 0.0281789; 75% 2.81561; 95%
45.2147

-> <stat> Spam scores for all runs: 100 items; mean 99.97; sdev 0.26
-> <stat> min 97.3715; median 100; max 100
-> <stat> percentiles: 5% 99.9512; 25% 100; 75% 100; 95% 100

>From that alone you can deduce that this tiny 10-fold cv run using
chi-combining nailed all the spam (min spam score was over 95), nailed at
least 75% of the ham (75% of all ham scores were under 2.82 < 5), and that
no ham scored in the spam zone (max ham score was < 86).

BTW, it's a curious thing that *all* schemes have been better at nailing
spam than ham with very little training data, going all the way down to
training on just one of each.  I still don't know where the cutoff point is
in my data (i.e., by the time I run my fat test, the roles are reversed:
it's better at nailing ham than spam).