[spambayes-dev] Very small change for composite word tokenizing.

T. Alexander Popiel popiel at wolfskeep.com
Thu Aug 7 10:32:20 EDT 2003


In message:  <1ED4ECF91CDED24C8D012BCF2B034F1302A9F632 at its-xchg4.massey.ac.nz>
             "Meyer, Tony" <T.A.Meyer at massey.ac.nz> writes:
>
>(I changed "yield w" to "yield 'compound:' + w")
>
>filename:  august_no_seans  kennys
>                   august_seans
>ham:spam:  7900:15260      7900:15260
>                   7900:15260
>fp total:        2       2       2
>fp %:         0.03    0.03    0.03
>fn total:      176     172     174
>fn %:         1.15    1.13    1.14
>unsure t:      501     499     491
>unsure %:     2.16    2.15    2.12
>real cost: $296.20 $291.80 $292.20
>best cost: $489.60 $488.80 $485.00
>h mean:       0.63    0.62    0.61
>h sdev:       4.84    4.81    4.80
>s mean:      94.52   94.57   94.56
>s sdev:      18.67   18.56   18.58
>mean diff:   93.89   93.95   93.95
>k:            3.99    4.02    4.02
>
>Interesting.  FN's are better than not doing anything with the compound
>words, but not as good as with just the word.  Unsures, however, are
>even better.  I might try this on a different corpus and see how it goes
>there.

Here's my results:

filename:   normal fragment
                           compound
ham:spam:  1978:6166       1978:6166
                   1978:6166
fp total:        1       1       1
fp %:         0.05    0.05    0.05
fn total:       25      28      25
fn %:         0.41    0.45    0.41
unsure t:      152     172     154
unsure %:     1.87    2.11    1.89
real cost:  $65.40  $72.40  $65.80
best cost:  $41.80  $44.20  $41.40
h mean:       0.27    0.25    0.26
h sdev:       3.80    3.71    3.76
s mean:      98.66   98.51   98.65
s sdev:       8.56    8.97    8.51
mean diff:   98.39   98.26   98.39
k:            7.96    7.75    8.02

The 'compound:' modifier on the generated tokens makes the
fragmentation code neutral for me, again.

- Alex



More information about the spambayes-dev mailing list