New subject: GSOC 2013 project discussion

April 17, 2013


      Avik Pal writes:
...
Meanwhile It would be much appreciated if someone can direct me to
an labeled dataset available on line.
By "labelled" you mean pre-classified into spam vs ham?  I see you
already found one, but you could also check the SpamBayes and
SpamAssassin distributions.
...
Here I have a suggestion, after submitting, whenever an email is
classified as Spam, we store it in a separate archive and after the
end of the day send them a mail telling "this is the digest for all
the mails that Mailman thinks to be Spam" the subscriber may go
there and can view them and also can mark them as not Spam,
I suggest that you present this as an option for users who want to
tune the filters, and as something that can be used pre-release to
develop the initial parameters for the distributed classifier.
Although Bayesian classifiers do offer the option to train or tune
your personal classifier on a local corpus, most users just stick with
the distribution parameters plus self-training.  It's pretty effective
(surprisingly so to me).  I guess the logic is that spammers aren't
terribly creative.
...
Emails which stays as Spam will be dropped after a month
Let's think carefully about that.  Everybody deletes the spam; that's
why you started by asking for a labelled dataset, because nobody keeps
one around.  Somebody really ought to do the public service of
collecting a corpus.  Of course, if you do arrange to keep it around,
it's going to need to be an option that sites and list owners can
disable.

Re: [Mailman-Developers] GSOC 2013 project discussion

Stephen J. Turnbull

Avik Pal

tags

participants (2)