[Spambayes] options.skip_max_word_size.

Tim Peters tim.one@comcast.net
Mon Oct 28 07:32:11 2002


[Anthony Baxter]
> I noticed a bunch of really nice ham clues were getting skipped in some
> of my personal email's 'unsure' bucket.  They were words like
> 'interconnection' and other longer techie-words. I added an option
> skip_max_word_size and tried boosting it to 20 (from the default of 12).
>
> cmp.py shows this (skip_max_word_size 12 on left, 20 on right)

...
> total unique fp went from 0 to 0 tied
> mean fp % went from 0.0 to 0.0 tied
>
> false negative percentages
>     0.000  0.277  lost  +(was 0)
>     0.559  0.838  lost   +49.91%
>     0.836  1.114  lost   +33.25%
>     0.279  0.836  lost  +199.64%
>
> won   0 times
> tied  0 times
> lost  4 times
>
> total unique fn went from 6 to 11 lost   +83.33%
> mean fn % went from 0.418410855729 to 0.766214465324 lost   +83.12%

> ...
> Unfortunately, cmp.py skips the important bit. My 'unsure' numbers
> went from 164 to 135!

Under the default costs, this would be judged close to a wash:  5 new fn @
$1 was a loss of $5, while 29 fewer unsure @ $.20 was a gain of $5.80.
table.py would show this more clearly, and the histogram analysis (which
table.py summarizes) would tell us whether you could have gotten just as
good an improvement by changing your ham_cutoff and spam_cutoff values (it's
impossible to guess that from what you posted).

> I'm not sure if this is just something that's an artifact of my
> own data, or more general - if others could try it as well, it
> would be good.

It's something I haven't tried under chi-combining yet, so I will, but not
right now.  In previous tests, boosting to 13 didn't have significant effect
on error rates but did boost the database size.  This was before we had a
usable notion of middle ground, though, so I've no idea what effect those
older tests may have had on the unsure rate.