[Spambayes] Spam Bayes use in a corporation
tim.one at comcast.net
Wed May 19 22:53:16 EDT 2004
[Michael C. Neel]
> As said SpamBayes is under an OpenSource license. There is no money
> involed, but there is one simple rule it all boils down to: You can
> take it, use it, sell it, give it away, and even change it - but when
> you change it you have to make those changes available to anyone who
> wants them. Posting the to the list or sending them in as a patch
> would be enough.
The PSF license SpamBayes is released under does not require that. People
are free to build proprietary ("closed source") software incorporating any
or all of the SpamBayes code, and keep their changes secret, if that's what
they want (they won't really be doing themselves or their users a favor by
keeping their code secret, but it takes experience rather than arguments to
understand why that's so). The PSF license does require that derivative
works include "a brief summary of the changes made" to the SpamBayes code
they incorporate, but "brief summary" means what says. For example,
"replaced classifier.py's probability calculations with a secret algorithm"
is good enough (if that's what they did).
Requiring derivative works to be released under a particular kind of license
is a feature of *some* open source licenses, most notably the GPL. But it's
not part of the definition of Open Source as promulgated by the Open Source
Initiative, and the PSF and GPL licenses are both certified as Open Source
by the OSI:
> Read up on the downsides of whitelisting and blacklisting so you are
> ready to show why a bayes filter is better. Then the reason SpamBayes
> works better than the others is most Bayes filters use two "buckets" -
> good and bad; spambayes uses 3 - good bad and unknown. This helps
> because Dr Bayes's filter never expected the data to try and fool him,
> i.e. spam acting like ham.
Or vice versa. Some messages are plain ambiguous, and require human
judgment to classify correctly. I'm still delighted at how well SpamBayes
usually manages to isolate those.
> And last thing I'll add is my own personal results. I have two account
> filtered with spambayes, one gets about 20 hams a day and 400 spams,
> the other gets maybe 5 hams a week and 30 spams a day. With spambayes
> trained on a set of ~100 spam and ~100 ham for each account, I see
> maybe 5 suspects a day for both accounts combined. After a few weeks
> there are no more mislabeled spams as hams and vice versa (these are
> rare to start with). No other spam tool I know of is this good.
I see you took the advice to keep training data balanced to heart. Good for
you! It really does work best that way, and we still don't have a good
approach to living with badly unbalanced training data. Then again, nobody
pays me to think about that either <wink>.
More information about the Spambayes