Whitelist/verification spam filters

David Mertz, Ph.D. mertz at gnosis.cx
Wed Aug 28 05:59:05 CEST 2002

"Mark McEahern" <marklists at mceahern.com> wrote previously:
|If the whitelist system (such as TMDA) allows you to view the messages that
|have not been confirmed, then it's not excluding them, it's merely filtering
|them.  So, in that sense, it's better than nothing (unless you view the
|confirmation request as a distinctly negative thing).

If I need to review rejected messages before final deletion then the
real work TMDA does for me is significantly less.  The truth is, I don't
trust *any* filtering system enough to forgo reviewing the rejects
occassionally.  But a "passive" system--whether based on regexen or on
statistical models (Bayes)--avoids the small extra burden placed on
legitimate correspondents by whitelist/verification systems.

Moreover, there is a whole category of legitimate messages that I forgot
to mention in my previous enumeration.  I frequently "correspond" with
automated email robots.  For example, when I make a purchase online, I
often get some sort of confirmation email, and sometimes must respond to
this confirmation to complete the transaction.  I do this often enough
that adding each thing to a whitelist is a hassle--and even if I were to
try, I am rarely sure exactly what return address will appear on an
automated response (sometimes even the domain is different, e.g. someone
else handles payment billing).  And even if I took the work to whitelist
a robot, it seems slightly silly for a transaction with a vendor I will
never talk to again after today.

|Well, I may not be understanding you here.  But let's say you have a body of
|messages: M.  Take a subset of them up to a certain point in time-->M1.
|Take all the from addresses in M1 and add them to the whitelist.  Then, see
|how many messages in M-M1 are from addresses not harvested from M1.

There is a good idea in this.  But I think there are several problems
with actually reporting this as quantitative data.  An historical
comparison like this would require some programming, which is of course,
some extra work.  But that's not unreasonable.  What I do not know in
the suggested analysis is what percentage of legitimate correspondents
actually WOULD respond to the authentication challenge.  The methodology
suggested assumes they all would, but that's the very issue in question.
Moreover, even though I think it is unlikely, it is *possible* that some
spammers would respond to the authentication, thereby creating a
category of false negatives in the filtering.

In my article, I intend to discuss whitelist/filtering in generalities,
including my expectations about false positives and false negatives.
But I belive that any specific quantification would be misleading in
this case.

Yours, David...

    _/_/_/ THIS MESSAGE WAS BROUGHT TO YOU BY: Postmodern Enterprises _/_/_/
   _/_/    ~~~~~~~~~~~~~~~~~~~~[mertz at gnosis.cx]~~~~~~~~~~~~~~~~~~~~~  _/_/
  _/_/  The opinions expressed here must be those of my employer...   _/_/
 _/_/_/_/_/_/_/_/_/_/ Surely you don't think that *I* believe them!  _/_/

More information about the Python-list mailing list