Re: [Mailman-Developers] anti-spam filter

April 21, 2013 · *inverse*


      Pratik Sarkar writes:
...
Okay so what should a gsoc student concentrate on for the project?
Writing the proposal!<wink/>
...
1.a standardized interface (e.g. MILTER, SMTP/LMTP transport)
Very important.  In the case of a filter proposal, which one is up to
you (both are important to Mailman because milter is more flexible,
but it's not available in Exim).
...
2.Handler which delegates to external spam filtering packages
Much preferred to 3 by pretty much all the Mailman developers who have
spoken up.
...
3.A totally new spam filter
IMHO, unless it includes a facility for sending Black Helicopters to
shut down spammers "permanently", don't bother.  SpamAssassin and
SpamBayes are good, and both can be trained -- it's just that users
don't want to go to that much effort.  I see very little room for a
breakthrough (defined as "clearly going to be at least as good SA or
SB, and near zero effort to train to be better") in this area by any
of the students who have proposed them: clearly, none of them could be
called "experts" on spam or on classifiers yet.  (And that matters,
both are pretty big subjects that will require a lot of weeks to learn
about.)
If someone really wants to do this, start now and come back next year
with a very precise proposal of how you're going to construct the
classifier and why it will be a breakthrough (as defined above).  It's
worth doing -- not only will you get the GSoC, but any degree of
success will make you a "name" in the field.[1]
...
4.An interface where users can manually tag "this mail is a spam" (which
remain unfiltered) to improve existing spam database.
Define "users".  For most of the definitions I can think of, though,
it's a TAGUI (They Aren't Gonna Use It), so why bother?  One of two
exceptions is that I could see an *inverse* to tagging spam, where site
owners provide the service of running a trainer over the ham in existing
archives.  Implementing such a trainer is *very* difficult, however,
because it's very likely to result in over-training unless very
accurately tuned, and it must account for the bias of having no spam
in the training corpus.
The other exception would be implementing tagging as part of a
moderation interface.  This is going to be an even weirder corpus,
since it's the boundary of spam and ham.  It could easily end up
"thrashing", ie, picking up small irrelevant differences, and making
detection of both spam and ham worse.  So this would require a lot of
empirical evidence to convince people to use it in production.  (Ie,
hard, boring work actually looking at spam.)
Footnotes:
[1]  http://www.youtube.com/watch?v=MGhEEuE56cY

Re: [Mailman-Developers] anti-spam filter

Stephen J. Turnbull