[Spambayes] full o' spaces

Neil Schemenauer nas at python.ca
Fri Mar 7 10:14:02 EST 2003

Skip Montanaro wrote:
> I just received a message (attached) in which every word in the body was
> space-separated.

I wouldn't worry about it too much.  It doesn't look like an effective
spam to me.  I gave up reading it after the first line.  I don't think
the bozos who respond to spam would make any more of an effort to read

> I'm working on a tokenizer patch.

Perhaps we should be careful about adding stuff unless we can show a
statistically significant improvement in error rates given real test

That said, it seems logical that it would be better if short words were
not completely discarded by the tokenizer.  Perhaps it would be enough
to remember the ratio of dropped words to generated tokens.  Something

    'shortratio:2**%d' % log2(nshort / ntokens) 

As you can tell, I love logarithms (as any true engineer should). :-)

Alternatively, perhaps we could just drop the lower limit on token


More information about the Spambayes mailing list