[spambayes-dev] Very small change for composite word tokenizing.
Kenny Pitt
kennypitt at hotmail.com
Wed Aug 6 15:30:28 EDT 2003
Sean True wrote:
> This is the code that does it, in context, if not in patch form. I had
> mailed to to Tony, but not the whole list.
> Sorry about that.
>
> -- Sean
>
> Not exactly a patch, but it's a one minute cut and paste. I'm theorizing
> that the memory hit is not horrendous -- mostly generates sensible fragments
> www.microsoft.com -> www, microsoft, com
> Very_naughty_bits -> very, naughty, bits
>
[snip]
>
> -> # Break up composite words looking for good stuff
> -> for w in longword_re.findall(word):
> -> if 3 <= len(w) <= maxword:
> -> yield word
> ->
Seems like most people are seeing this change as a loss or at best no
gain. I wonder if it would make a difference in the accuracy if we
returned special compound word tokens instead of returning the
components as normal words? Something like:
yield 'compound:' + word
I'm just speculating here because I, unfortunately, don't have a
sufficient number of messages saved up to test this myself. Anyone want
to give this variation a try?
--
Kenny Pitt
More information about the spambayes-dev
mailing list