[Spambayes] Stemming and stopword elemination

Fri Jan 17 15:51:54 EST 2003

[Alexander Leidinger]
> has someone already experimented with Information Retrieval techniques
> like stopword elemination (stopwords: the, a, an, or, and, ...) and word
> stemming?

Yes and no.

Stopword elimination doesn't make sense here.  A typical IR application
requires space proportional to the number of times a word appears, but this
app doesn't:  one word == one database entry, no matter how many times the
word appears.  Identifying stopwords would complicate and slow the code, and
introduce language dependence, for a trivial database savings.

Some Classic Bayesian classifiers remove stopwords for another reason
(related to one discussed below), but that reason doesn't make sense in this
code either:  when scoring, the classifier automatically ignores words with
a spamprob close to 0.5, so stopwords that truly *are* common across all
kinds of texts have no effect on scoring.

Stemming is a different issue.  We not only don't stem, we don't even strip
punctuation.  So, e.g., "free" and "free," and "free:" and "(free" and
"free--" and "free?" and "free!" and "free!!!" (etc) are all considered
distinct by our tokenizer.  That definitely grows the database size, but
tests run both early and late in the project showed that leaving punctuation
in works better than taking it out.

In the literature on Classic Bayesian classifiers, better results are
reported when using stemming.  But they do something else very different
too:  a "mutual information" calculation (or moral equivalent) is done on
all the training data, to identify the N words with (in effect) the greatest
discriminatory power.  N is typically less than 1000, and all words not in
that set are completely ignored.  In that context, it's very easy to believe
that stemming is valuable, else minor word variations would compete with
entirely different words for the privilege of not being ignored.  OTOH, we
ignore nothing except for tokens with spamprobs close to 0.5.

> ...
> I don't think this will change the failure rate significantly (maybe
> better results with few training data, maybe worser; I don't expect
> much change with large training data), but it should reduce the size of
> the needed database.

I expect that stopword elimination would make no difference, unless the
stopword list contained words that are actually hammish or spammish in real
life (in which case stopword elimination would hurt); the database size
difference would be too small to notice.  I expect that stemming would hurt
period, although it would reduce database size.