[Spambayes] full o' spaces
tim.one at comcast.net
Fri Mar 7 22:57:38 EST 2003
>> I disagree. We should not abandon the rigorous, testing based
>> strategy that got SB to its current state. Adding more code every
>> time a spammer comes up with a new trick is completely reactionary
>> and will eventually destroy the code base.
> Hear, hear. Don't turn SpamBayes into a convoluted, hocus-pocus
> collection of ad-hoc rules a la SpamAssasin.
Indeed, I'd rather keep it a convoluted, hocus-pocus collection of
tokenization gimmicks <0.9 wink>. Really, I doubt SpamAssassin has anything
more bizarre than our "skip:" tokens, and I kept the latter because taking
them out hurt results. I've never been sure why -- and I was never able to
find a way of summarizing thrown-out "too-long tokens" that did as well,
either. There's magic enough to go around. Also ego deflaters! I'm still
convinced that preserving case should help, and also looking at (at least)
bigrams -- unfortunately, the data didn't agree. It may in the future,
though, if spam gets more sophisticated.
> Keep testing; if a technique doesn't measurably improve the result, toss
At the time I got yanked from this project, I was looking to remove code
rather than add more. There are too many tokenization options already, and
it isn't clear that some of them do anyone any good anymore. The
gary_combining classifier scheme should also go away.
More information about the Spambayes