Whitelist/verification spam filters

Paul Wright -$P-W$- at verence.demon.co.uk
Wed Aug 28 14:37:40 CEST 2002

In article <mailman.1030469441.30834.python-list at python.org>,
David Mertz, Ph.D. <mertz at gnosis.cx> wrote:
>While the characterization as "evil" is just plain silly, I agree with
>some of the criticism of this style of spam filtering.  Unlike
>McEahern, I have a quite large set of people who contact me.  A lot of
>them are not "regular", but are still quite legitimate--certainly at
>least hundreds of such people in the last year, say (people write me
>about my articles and my software, but perhaps only a few times close
>together for a brief conversation).

Indeed. One other thing which I've not seen mentioned yet is what
happens when two people using such systems email each other for the
first time. Unless the system whitelists everyone you send email to, the
confirmation message from the second person is caught in the filter of
the first which sends a confirmation response. Well designed software
will avoid a mail loop at this point, but the deadlock requires human
intervention to remove.

>I am writing an article comparing spam filtering techniques for IBM
>developerWorks, as it happens. 

Are you aware of the Distributed Checksum Clearinghouse (DCC)? That
seems to be a good way of dealing with spam, to my mind. Running on the
server, it counts the number of times similar messages have been seen
(by storing a hash of the body of the email message). Messages which
have been seen a large number of times are either spam or mailing lists.
Users need to whitelist mailing lists for that reason. Servers can flood
the counts of each hash between themselves to co-operate in filtering.

The only problem I can see with this idea is the "hash busters" which
spammers include in their spam. DCC has some ad hoc ways of getting
around hash busters, but it uses cryptographic hash funtions once it has
done the ad hoc stuff. A fuzzy digest function would be better. One
candidate is nilsimsa (see Google) which is based on trigrams. However,
the statistical properties of nilsimsa aren't amenable to useful
analysis at the moment (it works on the number of bits which are similar
in two digest, so you'd like it if all bits had an equal probability of
similarity, however, they don't, when I feed it about 10 000 email
bodies from my archives).

I'd be interested to learn of any fuzzy digest functions people have
come across.

Paul Wright | http://pobox.com/~pw201 |

More information about the Python-list mailing list