[spambayes-dev] Very small change for composite word tokenizing.
Kenny Pitt
kennypitt at hotmail.com
Wed Aug 6 15:40:58 EDT 2003
Kenny Pitt wrote:
[snip]
>
>> -> # Break up composite words looking for good stuff
>> -> for w in longword_re.findall(word):
>> -> if 3 <= len(w) <= maxword:
>> -> yield word
>> ->
>
>
> Seems like most people are seeing this change as a loss or at best no
> gain. I wonder if it would make a difference in the accuracy if we
> returned special compound word tokens instead of returning the
> components as normal words? Something like:
>
> yield 'compound:' + word
>
> I'm just speculating here because I, unfortunately, don't have a
> sufficient number of messages saved up to test this myself. Anyone want
> to give this variation a try?
>
Uh oh, just noticed a bug in the original that I didn't catch before
hitting Send. The original code above should be:
yield w
instead of:
yield word
The variation would then be:
yield 'compound:' + w
Did everyone who previously tested this change catch the error? Without
this fix you would be inserting the *entire* compound token into your
training data once for each component word found (e.g. Very_Naughty_Bits
would result in 'Very_Naughty_Bits' with a count of 3 instead of 'Very',
'Naughty', and 'Bits' each with a count of 1). This could definately
have a negative impact on the results.
--
Kenny Pitt
More information about the spambayes-dev
mailing list