[Spambayes] Tokenizing numbers and money

Thu Oct 17 06:34:38 2002

[Tim]
> We do know that each time it's been tried, fiddling the value of
> robinson_probability_s has had a real effect on results, and that
> reducing it from 1 has always helped.  The effect of reducing it
> is to give more extreme spamprobs to rare words, so we already
> know that the treatment of rare words is important (or was important,
> in the schemes under which that experiment was tried).  I don't
> know how numbers specifically fit into that.

[Rob Hooft]
> The problem is that the final scoring has been adapted so thoroughly
> since those tests, that all of that should be done again.

Along with everything else <0.1 wink> -- everything is always open to
question here.  But I have to point out that the training and scoring code
has remained absolutely regular under all schemes since abandoning Graham's
original collection of deliberate biases:  no special cases, no warts, no
tweaks (the *tokenizer* code is a different story).

The words with extreme spamprobs have the strongest effects under all
schemes, and s controls how quickly or slowly a spamprob can *get* extreme
relative to the # of msgs a word has been seen in.  In that sense, there's
some reason to believe "the best" value for s is more a function of the data
than of the combining scheme.  Make s too small and too much credence is
given to accidents; make s too large and the amount of training data needed
to get crisp decisions zooms.

> And then it becomes very difficult, because the procedure is so
> good now that we're all looking with a microscope at all our
> fp/fn's and anyway, I "agree" (that it "looks" wrong) with the
> filter in most of my fp/fn cases.

There's something else to vary too:  nobody has looked at fiddling
max_discriminators under the newer schemes, and from what I see here I think
we all leave it at the default 150, which was chosen based on the
death-match results pitting Gary's original scheme against Paul's scheme.
It could be that max_discriminators should change.

> I did try something:
>
> s=0.25:
> -> <stat> Ham scores for all runs: 16000 items; mean 0.51; sdev 4.70
> -> <stat> min -1.33227e-13; median 1.19543e-11; max 100
> --
> -> <stat> Spam scores for all runs: 5800 items; mean 99.10; sdev 5.81
> -> <stat> min 2.89463e-09; median 100; max 100
>
> s=0.45:
> -> <stat> Ham scores for all runs: 16000 items; mean 0.60; sdev 5.00
> -> <stat> min -1.44329e-13; median 2.66842e-11; max 100
> --
> -> <stat> Spam scores for all runs: 5800 items; mean 99.04; sdev 5.74
> -> <stat> min 7.69111e-09; median 100; max 100
>
> s=0.75:
> -> <stat> Ham scores for all runs: 16000 items; mean 0.73; sdev 5.43
> -> <stat> min -1.11022e-13; median 9.83325e-11; max 100
> --
> -> <stat> Spam scores for all runs: 5800 items; mean 98.95; sdev 5.68
> -> <stat> min 3.83111e-05; median 100; max 100

That all makes sense, right?  The lower s, the more extreme spamprobs get,
ahd the higher s the less extreme.  So from top to bottom, ham means and
medians increase, spam means and medians decrease (well, that last is
invisible for spam at this level of precision:  at least half your spam
scores above 100, to 6 significant digits, under all variations), and sdevs
for all increase.

> And:
>
> s=0.25:
> -> best cost for all runs: $109.60
> -> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20
> -> achieved at 2 cutoff pairs
> -> smallest ham & spam cutoffs 0.48 & 0.93
> ->     fp 6; fn 13; unsure ham 43; unsure spam 140
> ->     fp rate 0.0375%; fn rate 0.224%; unsure rate 0.839%
> -> largest ham & spam cutoffs 0.49 & 0.93
> ->     fp 6; fn 14; unsure ham 39; unsure spam 139
> ->     fp rate 0.0375%; fn rate 0.241%; unsure rate 0.817%
>
> s=0.45:
> -> best cost for all runs: $112.40
> -> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20
> -> achieved at 2 cutoff pairs
> -> smallest ham & spam cutoffs 0.495 & 0.975
> ->     fp 3; fn 15; unsure ham 42; unsure spam 295
> ->     fp rate 0.0187%; fn rate 0.259%; unsure rate 1.55%
> -> largest ham & spam cutoffs 0.5 & 0.975
> ->     fp 3; fn 16; unsure ham 38; unsure spam 294
> ->     fp rate 0.0187%; fn rate 0.276%; unsure rate 1.52%
>
> s=0.75:
> -> best cost for all runs: $108.20
> -> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20
> -> achieved at ham & spam cutoffs 0.505 & 0.95
> ->     fp 4; fn 13; unsure ham 46; unsure spam 230
> ->     fp rate 0.025%; fn rate 0.224%; unsure rate 1.27%
>
> Don't know what to think about this. Total cost looks fairly insensitive
> here, but the distribution over the types of cost is different.

The most interesting thing there may be a coincidence <0.3 wink>:  the
default s=0.45 was obtained from staring at all the reports that came in
during the Graham-vs-Robinson death match (tuning s for your data was part
of the task there, although it was called "a" at the time), then picking a
default value that appeared to get close to minimizing the fp rate across
testers.  And s=.45 miminized the fp rate in your results above.

With an absolute # of fp so low, though, I'm afraid that just one specific
oddball ham can easily warp the conclusions to fit it best.  If I try to
mentally discount that, I think the data above suggests most that a
higher-than-default value for s is better for some combination of

    your test data
    this combining scheme (which did you use?  chi-combining?)
    this value of robinson_minimum_prob_strength (ditto)
    this value of max_discriminators (ditto)

It's not obvious how much training data you used here either, but do note
that s=0.45 was picked from 10-fold cv runs with 200 ham and 200 spam in
each set (that exact setup was a requirment for participating in the death
match).  You appear to be using about 10x more ham and something like 2.5x
more spam than that, and I think it stands to reason that low s is
potentially more helpful the less training data you have (no matter what the
value of s, spamprobs *eventually* approach the raw estimates obtained from
counting -- if you have a lot of data, the really strong clues remain really
strong clues throughout this range of s values).

BTW, I was reading a paper on boosting, and one observation struck home:
boosting combines many rules in a weighted-average way, where the weights
are adjusted iteratively, between passes boosting the "importance" of the
examples the previous iteration misclassified.  What the author found was
that boosting worked better overall if he fiddled it to eventually *stop*
paying attention to examples that were persistently and badly misclassifed.
In effect, trying ever harder to fit the outliers warped the whole scheme in
their direction in ever more extreme ways, but almost by definition the
outliers didn't fit the scheme at all.

Similarly, I believe that some of our persistent fp and fn under this scheme
are simply never going to go away, and endless fiddling of parameters to try
to make them go away will hurt overall performance in a doomed attempt to
redeem them.  The combining schemes we've got now are excellent by any
measure, and I suspect it's time to leave them alone.