[spambayes-dev] Very small change for composite word tokenizing.
T. Alexander Popiel
popiel at wolfskeep.com
Thu Aug 7 10:32:20 EDT 2003
In message: <1ED4ECF91CDED24C8D012BCF2B034F1302A9F632 at its-xchg4.massey.ac.nz>
"Meyer, Tony" <T.A.Meyer at massey.ac.nz> writes:
>
>(I changed "yield w" to "yield 'compound:' + w")
>
>filename: august_no_seans kennys
> august_seans
>ham:spam: 7900:15260 7900:15260
> 7900:15260
>fp total: 2 2 2
>fp %: 0.03 0.03 0.03
>fn total: 176 172 174
>fn %: 1.15 1.13 1.14
>unsure t: 501 499 491
>unsure %: 2.16 2.15 2.12
>real cost: $296.20 $291.80 $292.20
>best cost: $489.60 $488.80 $485.00
>h mean: 0.63 0.62 0.61
>h sdev: 4.84 4.81 4.80
>s mean: 94.52 94.57 94.56
>s sdev: 18.67 18.56 18.58
>mean diff: 93.89 93.95 93.95
>k: 3.99 4.02 4.02
>
>Interesting. FN's are better than not doing anything with the compound
>words, but not as good as with just the word. Unsures, however, are
>even better. I might try this on a different corpus and see how it goes
>there.
Here's my results:
filename: normal fragment
compound
ham:spam: 1978:6166 1978:6166
1978:6166
fp total: 1 1 1
fp %: 0.05 0.05 0.05
fn total: 25 28 25
fn %: 0.41 0.45 0.41
unsure t: 152 172 154
unsure %: 1.87 2.11 1.89
real cost: $65.40 $72.40 $65.80
best cost: $41.80 $44.20 $41.40
h mean: 0.27 0.25 0.26
h sdev: 3.80 3.71 3.76
s mean: 98.66 98.51 98.65
s sdev: 8.56 8.97 8.51
mean diff: 98.39 98.26 98.39
k: 7.96 7.75 8.02
The 'compound:' modifier on the generated tokens makes the
fragmentation code neutral for me, again.
- Alex
More information about the spambayes-dev
mailing list