[Spambayes] Stemming and stopword elemination

Skip Montanaro skip at pobox.com
Fri Jan 17 15:19:48 EST 2003


    >> has someone already experimented with Information Retrieval
    >> techniques like stopword elemination (stopwords: the, a, an, or, and,
    >> ...) and word stemming?

    Tim> Yes and no.

    Tim> Stemming is a different issue.  We not only don't stem, we don't
    Tim> even strip punctuation.  

Well, mostly.  In the usual linguistic sense spambayes doesn't stem, however
the tokenizer does collapse some things.  Long strings are compressed to
something like "skip b 40" where 'b' is the first letter and '40' is the
length of the string (or the number of characters elided).  In the email
prefix stuff I checked in and the suffix stuff I am still pondering, I
generate tokens like pfxlen:%d up to some small threshold value.  Above
that, I just generate "pflen:big" or "sfxlen:big".  Otherwise, I'd have a
number of tokens in my database with keys of "pfxlen:N" (where is is a
"biggish" number) and a value of (1,0) (spammy hapaxes - seen once in spam
and never in ham).

Skip



More information about the Spambayes mailing list