Avik Pal writes:
Meanwhile It would be much appreciated if someone can direct me to an labeled dataset available on line.
By "labelled" you mean pre-classified into spam vs ham? I see you already found one, but you could also check the SpamBayes and SpamAssassin distributions.
Here I have a suggestion, after submitting, whenever an email is classified as Spam, we store it in a separate archive and after the end of the day send them a mail telling "this is the digest for all the mails that Mailman thinks to be Spam" the subscriber may go there and can view them and also can mark them as not Spam,
I suggest that you present this as an option for users who want to tune the filters, and as something that can be used pre-release to develop the initial parameters for the distributed classifier. Although Bayesian classifiers do offer the option to train or tune your personal classifier on a local corpus, most users just stick with the distribution parameters plus self-training. It's pretty effective (surprisingly so to me). I guess the logic is that spammers aren't terribly creative.
Emails which stays as Spam will be dropped after a month
Let's think carefully about that. Everybody deletes the spam; that's why you started by asking for a labelled dataset, because nobody keeps one around. Somebody really ought to do the public service of collecting a corpus. Of course, if you do arrange to keep it around, it's going to need to be an option that sites and list owners can disable.