[Mailman-Developers] Re: Anti-spam "killer app"?

David Champion dgc@uchicago.edu
Thu, 22 Aug 2002 09:44:32 -0500

* On 2002.08.22, in <1030025638.2258.44.camel@fornax.bibsys.no>,
*	"Daniel Buchmann" <Daniel.Buchmann@bibsys.no> wrote:
> Let's say 99.99% of all spam is in english (which is my experience), and
> my mother tongue is norwegian. ;)

Funny: my experience is that 60% of all spam is in Chinese, Spanish, or
Turkish. :)

> Let's also say that I usually never receive mails written in english.
> The Bayesian approach would then put all english words in a bad-words
> list (except words found in headers), and all norwegian words in a
> good-words list, wouldn't it?

Not quite. It would put the limited subset of English words which appear
in spam into a bad-words list. This isn't nit-picking: as soon as you
save one legitimate English message to your good list (copy one from
Usenet if you need to), the stats are weighted. Add another legit one,
and it's even smarter.

If you *never* receive legitimate English mail, this is not a problem to
you: all English truly is spam. But if you *sometimes* receive English
mail, add them all to your good list for a short while, and you'll find
that the filter figures it out.

> 1. What happens the day I join an english mailing list, or receive a
> mail written in english?

It *might* be marked as spam; it depends significantly on the headers,
and on any words shared between Norwegian and English.

> 2. What happens if I receive a mail written in norwegian but containing
> a few english words, i.e. quoting someone?

Nothing special. The volume of good Norwegian will make any English in
your message matter little. However, it's not necessarily likely that
these few English words will even be in your bad-words list.

> I'd say it would discard mail #1, but let through #2...
> What do you think..?

The moral is: don't activate a Bayesian filter and start filing all its
discoveries to the bitbucket. Watch it for a while, and when you're
comfortable, start saving its finds to a circular file. Check this file
occasionally, and file any false positives to your good-words list.
Also, meanwhile, file any missed spam to your bad-words list.

Why I say these things:

I've been using my home-brewed system based on this article for a
few days now, and it's pretty sharp. It's missed a few, but it's
learning quickly. It initially had some false positives: since I receive
postmaster and abuse mail at my domain, forwarded spam got flagged. But
it's since learned to ignore those, too, while flagging the actual spam
messages contained in those messages when they arrive separately. And
so far, it only 94 spams, 99 non-spams in the database, which together
provide 21,000 text tokens I have data on. Graham's article discusses
having 4000 of each, IIRC. I expect even better results as I approach
that, but I'm letting it happen naturally at this point.

(I'm certain that I could get the same results with fewer messages; I
started out by filing in a big pile of known messages, and I've been
fine-tuning since. There's lots of overlap in the statistical value
provided by these messages. At some point I'll clean out my database
completely and begin again, tabula rasa, to see how few messages it
takes to give me satisfactory results. But I probably won't try this
until I'm more or less done gnashing the software.)

In a mailing list server, you'd want to set up for identified spam to
be redirected for moderator approval, rather than flinging it away.
Once your databases are fleshed out pretty well, you might be able to
start rejecting messages above some very high rating (say, 98%) and
subnmitting for approval those above something lower (say, 90%).

 -D.			We establised a fine coffee. What everybody can say
 Sun Project, APC/UCCO	TASTY! It's fresh, so-mild, with some special coffee's
 University of Chicago	bitter and sourtaste. "LET'S HAVE SUCH A COFFEE! NOW!"
 dgc@uchicago.edu	Please love CAFE MIAMI. Many thanks.