[Spambayes] new option: generate_long_skips

Tim Peters tim.one@comcast.net
Mon, 30 Sep 2002 18:20:25 -0400


[Skip Montanaro]
> I just checked in a new option for the tokenizer:
> generate_long_skips.  The default is True.  I noticed when reviewing
> my false positives that one was overwhelmingly dominated by these
> tokens (which scored very high) because it contained an Excel
> spreadsheet attachment.

We ignore MIME sections that don't have text/* type.  Was this a spreadsheet
that identified itself as having text content-type?

>     false positive percentages
>         1.500  1.000  won    -33.33%
>         4.000  4.000  tied
>         3.000  2.500  won    -16.67%
>         1.500  1.500  tied
>         1.000  1.000  tied
>
>     won   2 times
>     tied  3 times
>     lost  0 times

Pure but small win.

>     total unique fp went from 22 to 20 won     -9.09%
>     mean fp % went from 2.2 to 2.0 won     -9.09%
>
>     false negative percentages
>         2.000  2.500  lost   +25.00%
>         1.000  1.000  tied
>         0.500  0.500  tied
>         1.500  1.500  tied
>         2.000  2.500  lost   +25.00%
>
>     won   0 times
>     tied  3 times
>     lost  2 times

Pure but small loss.

>     total unique fn went from 14 to 16 lost   +14.29%
>     mean fn % went from 1.4 to 1.6 lost   +14.29%
>
>     ham mean                     ham sdev
>       22.12   21.85   -1.22%        6.01    5.74   -4.49%
>       23.46   23.25   -0.90%        7.31    6.93   -5.20%
>       23.50   23.38   -0.51%        6.64    6.51   -1.96%
>       23.54   23.32   -0.93%        6.88    6.87   -0.15%
>       23.08   22.79   -1.26%        6.77    6.62   -2.22%
>
>     ham mean and sdev for all runs
>       23.14   22.92   -0.95%        6.76    6.57   -2.81%
>
>     spam mean                    spam sdev
>       72.49   71.95   -0.74%       13.82   14.02   +1.45%
>       71.34   70.61   -1.02%       13.70   13.45   -1.82%
>       73.12   72.58   -0.74%       12.88   12.80   -0.62%
>       72.40   72.01   -0.54%       12.71   12.65   -0.47%
>       70.71   70.10   -0.86%       13.91   13.74   -1.22%
>
>     spam mean and sdev for all runs
>       72.01   71.45   -0.78%       13.44   13.37   -0.52%
>
>     ham/spam mean difference: 48.87 48.53 -0.34

Mixed bag, but overall brought your ham and spam a little closer together.

> I think it might be helpful for people whose ham tends to get the
> occasional legitimate binary attachment.

The code intends to ignore those already; perhaps the MIME in the example
you looked at was incorrect, or perhaps tokenizer.textparts() is buggy?

> ...
> Note the very low spam_cutoff. [.40]

I did <wink>.  Your corpus remains unique this way!