[Spambayes] full o' spaces

Tim Peters tim.one at comcast.net
Fri Mar 7 14:22:51 EST 2003


[Neil Schemenauer]
> ...
> That said, it seems logical that it would be better if short words were
> not completely discarded by the tokenizer.  Perhaps it would be enough
> to remember the ratio of dropped words to generated tokens.  Something
> like:
>
>     'shortratio:2**%d' % log2(nshort / ntokens)
>
> As you can tell, I love logarithms (as any true engineer should). :-)

I've mentioned before that the metatoken

    (number of bytes)/(number of words)

was a very strong indicator in early tests.  An unusually high ratio of
bytes to words was a very strong spam indicator; spam with the interspersed
whitespace gimmick would have an unusually low ratio.  I didn't check in the
code, though, because it made no difference in error rates at the time.

But a single token doesn't carry much weight, and any gimmick that reduces
response rate (including those that make text harder to read) probably won't
last long.

> Alternatively, perhaps we could just drop the lower limit on token
> length.

Experiments were run on that, and they hurt.  See "How big should 'a word'
be?" in tokenizer.py.

Note that we have a configurable limit for the upper end of how big a word
can be.  The evidence in favor of adding it was (at best) weak.




More information about the Spambayes mailing list