[Spambayes] full o' spaces

Neil Schemenauer nas at python.ca
Fri Mar 7 10:14:02 EST 2003


Skip Montanaro wrote:
> I just received a message (attached) in which every word in the body was
> space-separated.

I wouldn't worry about it too much.  It doesn't look like an effective
spam to me.  I gave up reading it after the first line.  I don't think
the bozos who respond to spam would make any more of an effort to read
it.

> I'm working on a tokenizer patch.

Perhaps we should be careful about adding stuff unless we can show a
statistically significant improvement in error rates given real test
data.

That said, it seems logical that it would be better if short words were
not completely discarded by the tokenizer.  Perhaps it would be enough
to remember the ratio of dropped words to generated tokens.  Something
like:

    'shortratio:2**%d' % log2(nshort / ntokens) 

As you can tell, I love logarithms (as any true engineer should). :-)

Alternatively, perhaps we could just drop the lower limit on token
length.

  Neil



More information about the Spambayes mailing list