[spambayes-dev] Very small change for composite word tokenizing.

Meyer, Tony T.A.Meyer at massey.ac.nz
Thu Aug 7 12:39:14 EDT 2003


> Seems like most people are seeing this change as a loss or at best no 
> gain.  I wonder if it would make a difference in the accuracy if we 
> returned special compound word tokens instead of returning the 
> components as normal words?  Something like:
> 
>      yield 'compound:' + word
> 
> Anyone want to give this variation a try?

(I changed "yield w" to "yield 'compound:' + w")

filename:  august_no_seans  kennys
                   august_seans
ham:spam:  7900:15260      7900:15260
                   7900:15260
fp total:        2       2       2
fp %:         0.03    0.03    0.03
fn total:      176     172     174
fn %:         1.15    1.13    1.14
unsure t:      501     499     491
unsure %:     2.16    2.15    2.12
real cost: $296.20 $291.80 $292.20
best cost: $489.60 $488.80 $485.00
h mean:       0.63    0.62    0.61
h sdev:       4.84    4.81    4.80
s mean:      94.52   94.57   94.56
s sdev:      18.67   18.56   18.58
mean diff:   93.89   93.95   93.95
k:            3.99    4.02    4.02

Interesting.  FN's are better than not doing anything with the compound
words, but not as good as with just the word.  Unsures, however, are
even better.  I might try this on a different corpus and see how it goes
there.

=Tony Meyer



More information about the spambayes-dev mailing list