[Mailman-Developers] GSOC 2013 project discussion
Terri Oda
terri at zone12.com
Wed Apr 17 18:35:10 CEST 2013
On 13-04-17 10:10 AM, Avik Pal wrote:
> Don't lose hope Terri, after digging for a couple of hours came across
> this and its pretty much updated. http://untroubled.org/spam/
Finding sources of spam (like that one) isn't that hard; it's finding
sources of legit email combined with spam and classified and processed
in the same way that's challenging. As I said, you can combine a spam
source like this with a publicly available mailing list to make a
synthetic set, but scientifically speaking, those aren't really
preferred ways to handle data because they come from multiple sources.
The problem is that when you have multiple sources it sometimes becomes
too easy for a classifier to classify on less-than-useful features for
future use. For example, one might classify on the fact that the list
address won't appear in any of the To: or Cc: lines in the spam data
because it comes from a different source, the fact that many of the
spams will be from different time periods, the fact that the spam data
is anonymized differently from any list data you might have, etc. You
will wind up doing a lot of work to normalize the data sets to avoid
these classifiers (and we're talking weeks of really boring work here,
potentially, that you need to start Right Now if you're going to be
using such a set), and you run the risk of missing out on features that
would have been useful in a single-source set that have been completely
obliterated by the synthetic data set.
Terri
More information about the Mailman-Developers
mailing list