[spambayes-dev] Very small change for composite word tokenizing.
Meyer, Tony
T.A.Meyer at massey.ac.nz
Thu Aug 7 12:39:14 EDT 2003
> Seems like most people are seeing this change as a loss or at best no
> gain. I wonder if it would make a difference in the accuracy if we
> returned special compound word tokens instead of returning the
> components as normal words? Something like:
>
> yield 'compound:' + word
>
> Anyone want to give this variation a try?
(I changed "yield w" to "yield 'compound:' + w")
filename: august_no_seans kennys
august_seans
ham:spam: 7900:15260 7900:15260
7900:15260
fp total: 2 2 2
fp %: 0.03 0.03 0.03
fn total: 176 172 174
fn %: 1.15 1.13 1.14
unsure t: 501 499 491
unsure %: 2.16 2.15 2.12
real cost: $296.20 $291.80 $292.20
best cost: $489.60 $488.80 $485.00
h mean: 0.63 0.62 0.61
h sdev: 4.84 4.81 4.80
s mean: 94.52 94.57 94.56
s sdev: 18.67 18.56 18.58
mean diff: 93.89 93.95 93.95
k: 3.99 4.02 4.02
Interesting. FN's are better than not doing anything with the compound
words, but not as good as with just the word. Unsures, however, are
even better. I might try this on a different corpus and see how it goes
there.
=Tony Meyer
More information about the spambayes-dev
mailing list