[spambayes-dev] Re: spambayes-dev Digest, Vol 12, Issue 15

Fri Apr 16 23:52:53 EDT 2004

    >> One thing I think we need to be careful of is using test data sets
    >> whose messages are too old.  It's apparent the spammers are a moving
    >> target, so what worked one or six months ago (or perhaps even a week
    >> ago) may not work as well today.

    Thomas> .... A guy named Terry Sullivan (who knows a _lot_ more about
    Thomas> statistics than I do) analyzed [Thomas's data] and presented
    Thomas> some conclusions about spam volatility at the MIT conference
    Thomas> this past January. He composed a summary article about it here:

    Thomas> http://www.qaqd.com/research/spam-e1.htm

    Thomas> The upshot was spam changes a lot more slowly than common
    Thomas> thought suggests.

I'm not going to try and argue with statistics, however, if I understand the
summary article, it appears that two features in the principle component
analysis account for 86% of the properties of your data set and that all the
other features were indistinguishable from noise.  I don't know how 86%
relates to how much spam those two features would reliably detect,
especially in the presence of ham, but my guess is that it's much less than
the 99+% we need to have an effective spam filtering solution.  Looking at
how Spambayes has classified my mail since mid-December, I see 168k spams (~
60%), 87k hams (~ 31%) and 27k unsures (~ 10%).  If Spambayes was only
identifying 86% of the spams (does the PCA number imply that?), that would
be another 23k spams I'd have had to look at.  In addition, PCA doesn't seem
like it begins to address the issue of false positives and false negatives.
Who cares if it identifies 86% of the spams if it also erroneously
classifies 1% (to pick a number out of thin air) of the hams as spams?

It's clear that spammers try different things.  They have to move from one
mail host to another.  They have to cover their tracks by routing mail
through open relays.  They have to "infect" vulnerable machines to create
open relays for themselves.  They have to add hash busters.  They have to
disguise key words (like "v1 at grA").  They have to gut their sales pitch and
just refer you to a URL.  They have to add word salad (both nonsense words
and real, but randomly chosen words).  They do this and lots of other stuff
to try and squeak as much spam past filters as they can.  I believe they
will continue to try other tricks.  One can hope that they are running out
of tricks to try, but I'm pessimistic.

Skip