Re: [Mailman-Developers] GSOC 2013 project discussion
I'm glad you're somewhat aware of the issues. I frequently encounter folk who aren't aware of the issues in machine learning, so your "don't lose hope" email set off all kinds of warning bells in my head.
Going back to GSoC-specific stuff:
- Enron is a very old data set
- If you're going to use it, you need to be prepared to defend that choice. I'm not sure it's a choice that can be defended at all, knowing the field. It's probably not only an old data set, but a completely counter-productive one given the space in which Mailman operates.
So here's some things to think about:
(1) I want some justification of how this is going to be relevant to the problem you're trying to solve, which is "helping classify spam emails sent to a mailing list that the MTA was unable to classify"
(2) Many existing classifiers that run at the MTA level have already used the enron data set, so chances are any features you learn will either already have been incorporated. I have severe concerns that any new features you learn will result in over-fitting. How can you believe that yet another classifier trained on the same data will be worth the processing overhead and resulting delays in mail delivery when it seems likely that any improvement will be incremental at best?
(3) Enron is not going to help you make use of any list-specific
features. How can you use this data set to produce something that is
useful to Mailman, going beyond what any MTA-level spam filter can do?
(Note that we've been telling people to do spam filtering at the MTA
level for years and years and years; justifying this is not going to be
an easy task)
(4) If you're going to do cross-validation with other data to make claims that the final classifier will be relevant to list data, how is that data going to be obtained, processed, and used?
(5) Unless you've got a plan for making extensive use of the fact that you're classifying mailing list data and not general email, you're pretty much wasting our time since we are only interested in projects relevant to Mailman.
To be completely honest, I'm still seeing "student project for data
mining class" level thinking here, and that's not going to be good
enough for us. Especially considering that you didn't even know about
the most common data sets for this problem, I'm concerned that you
haven't yet reached the skill and experience necessary for us to
seriously consider a classifier as even a small part of a GSoC project.
We have to give priority to students who we are convinced can finish
their projects, and it seems like there's too many chances of you
getting stuck on finding data and using it correctly on a problem that
is actually meaningful to Mailman and not just a general classification
task.
Terri
On 13-04-17 10:51 AM, Avik Pal wrote:
ya I get your point, but see these are part of any machine learning project, and feature extraction has to be done considering the synthetic data set.
On 17 April 2013 22:05, Terri Oda <terri@zone12.com <mailto:terri@zone12.com>> wrote:
Finding sources of spam (like that one) isn't that hard; it's finding sources of legit email combined with spam and classified and processed in the same way that's challenging. As I said, you can combine a spam source like this with a publicly available mailing list to make a synthetic set, but scientifically speaking, those aren't really preferred ways to handle data because they come from multiple sources. well in this regard the only thing I can do is keep looking, I am
also aware that coming from different sources can make them skewed but again these things are never perfect and there are always scope for betterment, I think that our aim should be to implement a rudimentary classifier with fairly good performance to start with.
participants (1)
-
Terri Oda