Pratik Sarkar writes:
Okay so what should a gsoc student concentrate on for the project?
Writing the proposal!<wink/>
1.a standardized interface (e.g. MILTER, SMTP/LMTP transport)
Very important. In the case of a filter proposal, which one is up to you (both are important to Mailman because milter is more flexible, but it's not available in Exim).
2.Handler which delegates to external spam filtering packages
Much preferred to 3 by pretty much all the Mailman developers who have spoken up.
3.A totally new spam filter
IMHO, unless it includes a facility for sending Black Helicopters to shut down spammers "permanently", don't bother. SpamAssassin and SpamBayes are good, and both can be trained -- it's just that users don't want to go to that much effort. I see very little room for a breakthrough (defined as "clearly going to be at least as good SA or SB, and near zero effort to train to be better") in this area by any of the students who have proposed them: clearly, none of them could be called "experts" on spam or on classifiers yet. (And that matters, both are pretty big subjects that will require a lot of weeks to learn about.)
If someone really wants to do this, start now and come back next year with a very precise proposal of how you're going to construct the classifier and why it will be a breakthrough (as defined above). It's worth doing -- not only will you get the GSoC, but any degree of success will make you a "name" in the field.[1]
4.An interface where users can manually tag "this mail is a spam" (which remain unfiltered) to improve existing spam database.
Define "users". For most of the definitions I can think of, though, it's a TAGUI (They Aren't Gonna Use It), so why bother? One of two exceptions is that I could see an *inverse* to tagging spam, where site owners provide the service of running a trainer over the ham in existing archives. Implementing such a trainer is *very* difficult, however, because it's very likely to result in over-training unless very accurately tuned, and it must account for the bias of having no spam in the training corpus.
The other exception would be implementing tagging as part of a moderation interface. This is going to be an even weirder corpus, since it's the boundary of spam and ham. It could easily end up "thrashing", ie, picking up small irrelevant differences, and making detection of both spam and ham worse. So this would require a lot of empirical evidence to convince people to use it in production. (Ie, hard, boring work actually looking at spam.)
Footnotes: [1] http://www.youtube.com/watch?v=MGhEEuE56cY