[Spambayes] Stemming and stopword elemination
skip at pobox.com
Fri Jan 17 15:19:48 EST 2003
>> has someone already experimented with Information Retrieval
>> techniques like stopword elemination (stopwords: the, a, an, or, and,
>> ...) and word stemming?
Tim> Yes and no.
Tim> Stemming is a different issue. We not only don't stem, we don't
Tim> even strip punctuation.
Well, mostly. In the usual linguistic sense spambayes doesn't stem, however
the tokenizer does collapse some things. Long strings are compressed to
something like "skip b 40" where 'b' is the first letter and '40' is the
length of the string (or the number of characters elided). In the email
prefix stuff I checked in and the suffix stuff I am still pondering, I
generate tokens like pfxlen:%d up to some small threshold value. Above
that, I just generate "pflen:big" or "sfxlen:big". Otherwise, I'd have a
number of tokens in my database with keys of "pfxlen:N" (where is is a
"biggish" number) and a value of (1,0) (spammy hapaxes - seen once in spam
and never in ham).
More information about the Spambayes