Mailman 3 Re: [Mailman-Developers] GSOC 2013 project discussion - Mailman-Developers

April 17, 2013

      I'm glad you're somewhat aware of the issues.  I frequently encounter
folk who aren't aware of the issues in machine learning, so your "don't
lose hope" email set off all kinds of warning bells in my head.
Going back to GSoC-specific stuff:

Enron is a very old data set
If you're going to use it, you need to be prepared to defend that
choice.  I'm not sure it's a choice that can be defended at all, knowing
the field.  It's probably not only an old data set, but a completely
counter-productive one given the space in which Mailman operates.

So here's some things to think about:
(1) I want some justification of how this is going to be relevant to the
problem you're trying to solve, which is "helping classify spam emails
sent to a mailing list that the MTA was unable to classify"
(2) Many existing classifiers that run at the MTA level have already
used the enron data set, so chances are any features you learn will
either already have been incorporated.  I have severe concerns that any
new features you learn will result in over-fitting.  How can you believe
that yet another classifier trained on the same data will be worth the
processing overhead and resulting delays in mail delivery when it seems
likely that any improvement will be incremental at best?
(3) Enron is not going to help you make use of any list-specific
features.  How can you use this data set to produce something that is
useful to Mailman, going beyond what any MTA-level spam filter can do?

(Note that we've been telling people to do spam filtering at the MTA
level for years and years and years; justifying this is not going to be
an easy task)
(4) If you're going to do cross-validation with other data to make
claims that the final classifier will be relevant to list data, how is
that data going to be obtained, processed, and used?
(5) Unless you've got a plan for making extensive use of the fact that
you're classifying mailing list data and not general email, you're
pretty much wasting our time since we are only interested in projects
relevant to Mailman.
To be completely honest, I'm still seeing "student project for data
mining class" level thinking here, and that's not going to be good
enough for us.  Especially considering that you didn't even know about
the most common data sets for this problem, I'm concerned that you
haven't yet reached the skill and experience necessary for us to
seriously consider a classifier as even a small part of a GSoC project.

We have to give priority to students who we are convinced can finish
their projects, and it seems like there's too many chances of you
getting stuck on finding data and using it correctly on a problem that
is actually meaningful to Mailman and not just a general classification
task.
Terri
On 13-04-17 10:51 AM, Avik Pal wrote:
...
ya I get your point, but see these are part of any machine learning
project, and feature extraction has to be done considering the
synthetic data set.
On 17 April 2013 22:05, Terri Oda <terri@zone12.com
<mailto:terri@zone12.com>> wrote:
Finding sources of spam (like that one) isn't that hard; it's
finding sources of legit email combined with spam and classified
and processed in the same way that's challenging.  As I said, you
can combine a spam source like this with a publicly available
mailing list to make a synthetic set, but scientifically speaking,
those aren't really preferred ways to handle data because they
come from multiple sources.

well in this regard the only thing I can do is keep looking, I am 
also aware that coming from different sources can make them skewed but
again these things are never perfect and there are always scope for
betterment, I think that our aim should be to implement a rudimentary
classifier with fairly good performance to start with.

Re: [Mailman-Developers] GSOC 2013 project discussion

Terri Oda

tags

participants (1)