Whitelist/verification spam filters

David Mertz, Ph.D. mertz at gnosis.cx
Tue Aug 27 19:07:21 CEST 2002


"Mark McEahern" <marklists at mceahern.com> wrote previously:
|Well, and I don't mean this snidely, then perhaps the technique serves its
|purpose by filtering you out; i.e., since you don't care that much, perhaps
|what you had to say was not that important? ... I have a
|relatively limited set of people who I contact regularly.  It would be
|annoying if I had to reply each time--presumably, they could add me to a
|list so that I would only have to reply once.
|> Especially irritating is when someone emails _you_, and your
|> response needs to go through this level of filtering.

While the characterization as "evil" is just plain silly, I agree with
some of the criticism of this style of spam filtering.  Unlike McEahern,
I have a quite large set of people who contact me.  A lot of them are
not "regular", but are still quite legitimate--certainly at least
hundreds of such people in the last year, say (people write me about my
articles and my software, but perhaps only a few times close together
for a brief conversation).

A lot of my correspondents have flakey email systems, and might miss the
confirmation requests.  Many of them are non-native English speakers,
and might misunderstand the purpose of the automated response.  Even
more of them use multiple email messages, and the automated response
might not go to the address(es) they want to write me from.  Some are
lazy, and some leave school or work around the time a confirmation
message arrives.  I am quite certain that using a whitelist/verification
system would wind up excluding a significant number of messages that I
would otherwise wish to receive.

I am writing an article comparing spam filtering techniques for IBM
developerWorks, as it happens.  I will discuss a number of distinct
techniques, including the whitelist/verification approach.  Part of my
article is quantitative testing of false positive and false negative
categorization of large corpora I developed (i.e. selected from my email
archives).  I don't really know any way to include the
whitelist/verification approach in the quantitative data,
unfortunately--it can't be used against my saved collections of
messages, of course.  Even actually using it wouldn't really provide
good data (if I were to manually look through "to-be-verified" I would
pretty much have to whitelist the people who were legit, thereby
tainting the data).

A sneak peak at my results:  I find that Baysian filtering is much
better than a SpamAssassin "lots of regular expressions" approach or a
Razor/Pyzor "networked blacklist" approach.  But I also find--perhaps
surprisingly--that an analysis of trigrams does nearly as well as a
model based on words.  Actually, trigrams were better in my testing--but
I hand-tweaked parameters from my trigram analysis, but remained fairly
simplistic about the word-baysian approach.

Yours, David...

--
mertz@  | The specter of free information is haunting the `Net!  All the
gnosis  | powers of IP- and crypto-tyranny have entered into an unholy
.cx     | alliance...ideas have nothing to lose but their chains.  Unite
        | against "intellectual property" and anti-privacy regimes!
-------------------------------------------------------------------------





More information about the Python-list mailing list