[Spambayes] Tokenising clues

Tim Peters tim.one@comcast.net
Tue, 01 Oct 2002 15:54:32 -0400


[Neil Schemenauer]
> ...
> I don't like the way the tokenizer is heading right now either.

I only care which way the results are heading <wink>.

> I want to try generating n-grams from the headers.  If that can be
> made if work reasonably well I think it will be much better
> approach long term.

Be sure to read the comments in tokenizer.py about previous experiments with
character n-grams.  A string of length N produces N-n+1 character n-grams,
and that's a ton of clues for a single string.  For example,

Organization: Massachussetts Institute of Technology

is going to generate a big pile of ham clues, and if a spammer happens to
include that header too, it's going to be hard to overcome them.  There are
some specific examples in the aforementioned comments.  This should be less
severe now, though, since max_discriminators is about 10x larger than it
used to be.  Certainly worth trying!