[Spambayes] new option: generate_long_skips
Tim Peters
tim.one@comcast.net
Mon, 30 Sep 2002 18:20:25 -0400
[Skip Montanaro]
> I just checked in a new option for the tokenizer:
> generate_long_skips. The default is True. I noticed when reviewing
> my false positives that one was overwhelmingly dominated by these
> tokens (which scored very high) because it contained an Excel
> spreadsheet attachment.
We ignore MIME sections that don't have text/* type. Was this a spreadsheet
that identified itself as having text content-type?
> false positive percentages
> 1.500 1.000 won -33.33%
> 4.000 4.000 tied
> 3.000 2.500 won -16.67%
> 1.500 1.500 tied
> 1.000 1.000 tied
>
> won 2 times
> tied 3 times
> lost 0 times
Pure but small win.
> total unique fp went from 22 to 20 won -9.09%
> mean fp % went from 2.2 to 2.0 won -9.09%
>
> false negative percentages
> 2.000 2.500 lost +25.00%
> 1.000 1.000 tied
> 0.500 0.500 tied
> 1.500 1.500 tied
> 2.000 2.500 lost +25.00%
>
> won 0 times
> tied 3 times
> lost 2 times
Pure but small loss.
> total unique fn went from 14 to 16 lost +14.29%
> mean fn % went from 1.4 to 1.6 lost +14.29%
>
> ham mean ham sdev
> 22.12 21.85 -1.22% 6.01 5.74 -4.49%
> 23.46 23.25 -0.90% 7.31 6.93 -5.20%
> 23.50 23.38 -0.51% 6.64 6.51 -1.96%
> 23.54 23.32 -0.93% 6.88 6.87 -0.15%
> 23.08 22.79 -1.26% 6.77 6.62 -2.22%
>
> ham mean and sdev for all runs
> 23.14 22.92 -0.95% 6.76 6.57 -2.81%
>
> spam mean spam sdev
> 72.49 71.95 -0.74% 13.82 14.02 +1.45%
> 71.34 70.61 -1.02% 13.70 13.45 -1.82%
> 73.12 72.58 -0.74% 12.88 12.80 -0.62%
> 72.40 72.01 -0.54% 12.71 12.65 -0.47%
> 70.71 70.10 -0.86% 13.91 13.74 -1.22%
>
> spam mean and sdev for all runs
> 72.01 71.45 -0.78% 13.44 13.37 -0.52%
>
> ham/spam mean difference: 48.87 48.53 -0.34
Mixed bag, but overall brought your ham and spam a little closer together.
> I think it might be helpful for people whose ham tends to get the
> occasional legitimate binary attachment.
The code intends to ignore those already; perhaps the MIME in the example
you looked at was incorrect, or perhaps tokenizer.textparts() is buggy?
> ...
> Note the very low spam_cutoff. [.40]
I did <wink>. Your corpus remains unique this way!