[Mailman-Developers] GSOC 2013 project discussion

Wed Apr 17 17:32:12 CEST 2013

On 13-04-17 6:56 AM, Avik Pal wrote:
>           Meanwhile It would be much appreciated if someone can direct me to
> an labeled dataset available on line.
>
Leaving aside entirely the question of whether we should (or will) 
support any project that requires learning on this scale, as a former 
anti-spam researcher, I can at least answer this question.

Unfortunately, the answer is largely "good luck with that" -- good 
labelled email data is surprisingly hard to come by, and that challenge 
is one of the reasons I stopped doing research in that area.

When I was doing anti-spam research, the only viable public classified 
ham/spam set was the SpamAssassin one.  I don't believe it's been 
maintained with modern messages and at this point it may be useless.

Shortly after I left the field, people started using the Enron data set, 
which is pretty well classified by now, but again, is pretty long in the 
tooth.

Given that you're going to want to be classifying mailing list data, you 
may have to produce some synthetic data sets using information from 
publicly available mailing lists (e.g. the public archives of 
mailman-developers are available) and combining them with other data 
sources (e.g. publicly available collections of spam).  This won't have 
a whole lot of interesting sub-labels (some lists will have more than 
others, depending on their use of dlists/topics/pre-classification by 
the sender) and a synthetic set is generally regarded as a poor 
information source for reproducible results, but it could be enough in a 
pinch given that you're adding a feature rather than publishing 
scientific work.

Note that the GSoC timeline doesn't allow time for finding and creating 
such a set, so if you're going to use one, you should determine in 
advance what you'll be using and and be able to provide a link to the 
completely-ready-for-gsoc set in your proposal.

  Terri