[spambayes-dev] Very small change for composite word tokenizing.

Wed Aug 6 15:30:28 EDT 2003

Sean True wrote:

> This is the code that does it, in context, if not in patch form. I had
> mailed to to Tony, but not the whole list.
> Sorry about that.
> 
> -- Sean
> 
> Not exactly a patch, but it's a one minute cut and paste. I'm theorizing
> that the memory hit is not horrendous -- mostly generates sensible fragments
> www.microsoft.com -> www, microsoft, com
> Very_naughty_bits -> very, naughty, bits
> 
[snip]
>    
> ->             # Break up composite words looking for good stuff
> ->             for w in longword_re.findall(word):
> ->                 if 3 <= len(w) <= maxword:
> ->                     yield word
> ->                 

Seems like most people are seeing this change as a loss or at best no 
gain.  I wonder if it would make a difference in the accuracy if we 
returned special compound word tokens instead of returning the 
components as normal words?  Something like:

     yield 'compound:' + word

I'm just speculating here because I, unfortunately, don't have a 
sufficient number of messages saved up to test this myself.  Anyone want 
to give this variation a try?

-- 
Kenny Pitt