[Mailman-Developers] GSOC 2013 project discussion

Wed Apr 17 20:28:41 CEST 2013

I'm glad you're somewhat aware of the issues.  I frequently encounter 
folk who aren't aware of the issues in machine learning, so your "don't 
lose hope" email set off all kinds of warning bells in my head.

Going back to GSoC-specific stuff:

- Enron is a very old data set
- If you're going to use it, you need to be prepared to defend that 
choice.  I'm not sure it's a choice that can be defended at all, knowing 
the field.  It's probably not only an old data set, but a completely 
counter-productive one given the space in which Mailman operates.

So here's some things to think about:

(1) I want some justification of how this is going to be relevant to the 
problem you're trying to solve, which is "helping classify spam emails 
sent to a mailing list that the MTA was unable to classify"

(2) Many existing classifiers that run at the MTA level have already 
used the enron data set, so chances are any features you learn will 
either already have been incorporated.  I have severe concerns that any 
new features you learn will result in over-fitting.  How can you believe 
that yet another classifier trained on the same data will be worth the 
processing overhead and resulting delays in mail delivery when it seems 
likely that any improvement will be incremental at best?

(3) Enron is not going to help you make use of any list-specific 
features.  How can you use this data set to produce something that is 
useful to Mailman, going beyond what any MTA-level spam filter can do?  
(Note that we've been telling people to do spam filtering at the MTA 
level for years and years and years; justifying this is not going to be 
an easy task)

(4) If you're going to do cross-validation with other data to make 
claims that the final classifier will be relevant to list data, how is 
that data going to be obtained, processed, and used?

(5) Unless you've got a plan for making extensive use of the fact that 
you're classifying mailing list data and not general email, you're 
pretty much wasting our time since we are only interested in projects 
relevant to Mailman.

To be completely honest, I'm still seeing "student project for data 
mining class" level thinking here, and that's not going to be good 
enough for us.  Especially considering that you didn't even know about 
the most common data sets for this problem, I'm concerned that you 
haven't yet reached the skill and experience necessary for us to 
seriously consider a classifier as even a small part of a GSoC project.  
We have to give priority to students who we are convinced can finish 
their projects, and it seems like there's too many chances of you 
getting stuck on finding data and using it correctly on a problem that 
is actually meaningful to Mailman and not just a general classification 
task.

  Terri

On 13-04-17 10:51 AM, Avik Pal wrote:
>   ya I get your point, but see these are part of any machine learning 
> project, and feature extraction has to be done considering the 
> synthetic data set.
>
>
> On 17 April 2013 22:05, Terri Oda <terri at zone12.com 
> <mailto:terri at zone12.com>> wrote:
>
>
>
>     Finding sources of spam (like that one) isn't that hard; it's
>     finding sources of legit email combined with spam and classified
>     and processed in the same way that's challenging.  As I said, you
>     can combine a spam source like this with a publicly available
>     mailing list to make a synthetic set, but scientifically speaking,
>     those aren't really preferred ways to handle data because they
>     come from multiple sources.
>
>
>     well in this regard the only thing I can do is keep looking, I am 
> also aware that coming from different sources can make them skewed but 
> again these things are never perfect and there are always scope for 
> betterment, I think that our aim should be to implement a rudimentary 
> classifier with fairly good performance to start with.