[spambayes-dev] Very small change for composite word tokenizing.

Wed Aug 6 15:40:58 EDT 2003

Kenny Pitt wrote:

[snip]
> 
>>    ->             # Break up composite words looking for good stuff
>> ->             for w in longword_re.findall(word):
>> ->                 if 3 <= len(w) <= maxword:
>> ->                     yield word
>> ->                 
> 
> 
> Seems like most people are seeing this change as a loss or at best no 
> gain.  I wonder if it would make a difference in the accuracy if we 
> returned special compound word tokens instead of returning the 
> components as normal words?  Something like:
> 
>     yield 'compound:' + word
> 
> I'm just speculating here because I, unfortunately, don't have a 
> sufficient number of messages saved up to test this myself.  Anyone want 
> to give this variation a try?
> 

Uh oh, just noticed a bug in the original that I didn't catch before 
hitting Send.  The original code above should be:
     yield w
instead of:
     yield word
The variation would then be:
     yield 'compound:' + w

Did everyone who previously tested this change catch the error?  Without 
this fix you would be inserting the *entire* compound token into your 
training data once for each component word found (e.g. Very_Naughty_Bits 
would result in 'Very_Naughty_Bits' with a count of 3 instead of 'Very', 
'Naughty', and 'Bits' each with a count of 1).  This could definately 
have a negative impact on the results.

-- 
Kenny Pitt