Graham's spam filter

Paul Rubin phr-n2002b at
Fri Aug 23 07:06:46 CEST 2002

Roman Suzi <rnd at> writes:
> >The private database has to be separate for every user and protected
> >at least as well as the contents of the user's mailbox.  Otherwise the
> >spam filter becomes another Echelon or Carnivore, scanning private
> >user email for keywords and revealing them to third parties.
> Words could be hashed before put into private database.

I think that's not enough.  Let's say I want to know if you're
emailing somebody about artichokes, a fairly uncommon word.  I send
myself a few messages like "make nigerian money 3 inches longer
guaranteed" (so they will be classified as spam) but also containing
the word artichoke.  Now I send myself another message without the
spam keywords, but mentioning artichokes.

If you haven't been using the word artichoke in your previous email,
artichoke will now be flagged in the database as a spam word, so my
final artichoke message will get labelled as spam.  But if you HAVE
been emailing about artichokes, then "artichoke" will be in both
databases with similar probabilities, and my message won't get
flagged.  So the filter sharing databases leaks info about the
contents of your email.

So you need a separate database for every user.  It might, however, be
ok to initialize each person's database from a bunch of published spam
messages and a bunch of published non-spam messages.

More information about the Python-list mailing list