[Mailman-Developers] GSOC 2013 project discussion

Avik Pal avikpal.me at gmail.com
Wed Apr 17 18:02:03 CEST 2013


  thanks a lot Terri, I think I will go with the Enron email dataset and
they are to be cross validated against publicly available legitimate
mailing list mails and Spam and (hopefully) python's regular expressions
will help me a lot building the synthetic set.

Avik Pal
Bengal Engineering & Scieence University,Shibpur
github:https://github.com/avikpal
IRC:- irc://freenode/avikp,isnick
twitter:-https://twitter.com/avikpalme





On 17 April 2013 21:02, Terri Oda <terri at zone12.com> wrote:

>
> On 13-04-17 6:56 AM, Avik Pal wrote:
>
>>           Meanwhile It would be much appreciated if someone can direct me
>> to
>> an labeled dataset available on line.
>>
>>  Leaving aside entirely the question of whether we should (or will)
> support any project that requires learning on this scale, as a former
> anti-spam researcher, I can at least answer this question.
>
> Unfortunately, the answer is largely "good luck with that" -- good
> labelled email data is surprisingly hard to come by, and that challenge is
> one of the reasons I stopped doing research in that area.
>
> When I was doing anti-spam research, the only viable public classified
> ham/spam set was the SpamAssassin one.  I don't believe it's been
> maintained with modern messages and at this point it may be useless.
>
> Shortly after I left the field, people started using the Enron data set,
> which is pretty well classified by now, but again, is pretty long in the
> tooth.
>
> Given that you're going to want to be classifying mailing list data, you
> may have to produce some synthetic data sets using information from
> publicly available mailing lists (e.g. the public archives of
> mailman-developers are available) and combining them with other data
> sources (e.g. publicly available collections of spam).  This won't have a
> whole lot of interesting sub-labels (some lists will have more than others,
> depending on their use of dlists/topics/pre-**classification by the
> sender) and a synthetic set is generally regarded as a poor information
> source for reproducible results, but it could be enough in a pinch given
> that you're adding a feature rather than publishing scientific work.
>
> Note that the GSoC timeline doesn't allow time for finding and creating
> such a set, so if you're going to use one, you should determine in advance
> what you'll be using and and be able to provide a link to the
> completely-ready-for-gsoc set in your proposal.
>
>  Terri
>
>
>


More information about the Mailman-Developers mailing list