Mon Oct 28 07:32:11 2002
> I noticed a bunch of really nice ham clues were getting skipped in some
> of my personal email's 'unsure' bucket. They were words like
> 'interconnection' and other longer techie-words. I added an option
> skip_max_word_size and tried boosting it to 20 (from the default of 12).
> cmp.py shows this (skip_max_word_size 12 on left, 20 on right)
> total unique fp went from 0 to 0 tied
> mean fp % went from 0.0 to 0.0 tied
> false negative percentages
> 0.000 0.277 lost +(was 0)
> 0.559 0.838 lost +49.91%
> 0.836 1.114 lost +33.25%
> 0.279 0.836 lost +199.64%
> won 0 times
> tied 0 times
> lost 4 times
> total unique fn went from 6 to 11 lost +83.33%
> mean fn % went from 0.418410855729 to 0.766214465324 lost +83.12%
> Unfortunately, cmp.py skips the important bit. My 'unsure' numbers
> went from 164 to 135!
Under the default costs, this would be judged close to a wash: 5 new fn @
$1 was a loss of $5, while 29 fewer unsure @ $.20 was a gain of $5.80.
table.py would show this more clearly, and the histogram analysis (which
table.py summarizes) would tell us whether you could have gotten just as
good an improvement by changing your ham_cutoff and spam_cutoff values (it's
impossible to guess that from what you posted).
> I'm not sure if this is just something that's an artifact of my
> own data, or more general - if others could try it as well, it
> would be good.
It's something I haven't tried under chi-combining yet, so I will, but not
right now. In previous tests, boosting to 13 didn't have significant effect
on error rates but did boost the database size. This was before we had a
usable notion of middle ground, though, so I've no idea what effect those
older tests may have had on the unsure rate.