[Spambayes] full o' spaces
tim.one at comcast.net
Fri Mar 7 14:22:51 EST 2003
> That said, it seems logical that it would be better if short words were
> not completely discarded by the tokenizer. Perhaps it would be enough
> to remember the ratio of dropped words to generated tokens. Something
> 'shortratio:2**%d' % log2(nshort / ntokens)
> As you can tell, I love logarithms (as any true engineer should). :-)
I've mentioned before that the metatoken
(number of bytes)/(number of words)
was a very strong indicator in early tests. An unusually high ratio of
bytes to words was a very strong spam indicator; spam with the interspersed
whitespace gimmick would have an unusually low ratio. I didn't check in the
code, though, because it made no difference in error rates at the time.
But a single token doesn't carry much weight, and any gimmick that reduces
response rate (including those that make text harder to read) probably won't
> Alternatively, perhaps we could just drop the lower limit on token
Experiments were run on that, and they hurt. See "How big should 'a word'
be?" in tokenizer.py.
Note that we have a configurable limit for the upper end of how big a word
can be. The evidence in favor of adding it was (at best) weak.
More information about the Spambayes