[Mailman-Developers] GSOC 2013 project discussion

Wed Apr 17 18:35:10 CEST 2013

On 13-04-17 10:10 AM, Avik Pal wrote:
> Don't lose hope Terri, after digging for a couple of hours came across 
> this and its pretty much updated. http://untroubled.org/spam/

Finding sources of spam (like that one) isn't that hard; it's finding 
sources of legit email combined with spam and classified and processed 
in the same way that's challenging.  As I said, you can combine a spam 
source like this with a publicly available mailing list to make a 
synthetic set, but scientifically speaking, those aren't really 
preferred ways to handle data because they come from multiple sources.

The problem is that when you have multiple sources it sometimes becomes 
too easy for a classifier to classify on less-than-useful features for 
future use.  For example, one might classify on the fact that the list 
address won't appear in any of the To: or Cc: lines in the spam data 
because it comes from a different source, the fact that many of the 
spams will be from different time periods, the fact that the spam data 
is anonymized differently from any list data you might have, etc.  You 
will wind up doing a lot of work to normalize the data sets to avoid 
these classifiers (and we're talking weeks of really boring work here, 
potentially, that you need to start Right Now if you're going to be 
using such a set), and you run the risk of missing out on features that 
would have been useful in a single-source set that have been completely 
obliterated by the synthetic data set.

  Terri