[spambayes-dev] Whitelists (was: A spectacular false positive)

Sat Nov 15 11:30:22 EST 2003

[Tim]
> I have to note that it argues in favor of a whitelist
> gimmick too -- although that wouldn't have done me any good since I never
> would have anticipated that anything Jeremy sent would get scored as spam.
> Even if I had anticipated it, I don't remember all the email accounts he
> uses, and probably wouldn't have thought to whitelist the account he used to
> send this one.

I've been thinking about whitelists, and the more I think about them the
more I'm in favour of them.  We can do things with a built-in SpamBayes
whitelist that you just can't do with standard email client filters -
things that I think would address your objections, Tim.

All these rules would be optional, and possibly behind another rule that
says "An address must qualify N times before this happens":

 o Whenever a message is trained as ham, add the From address to the
   whitelist.

 o Whenever a message is trained as spam, remove the From address from the
   whitelist.

 o Whenever a message is received from a whitelisted addresses, and scores
   as solid (for some value of 'solid') ham, auto-train the message as
   ham.  You'd use this for personal acquaintances only, and not for
   mailing lists or organisations (amazon.com, ebay.com, etc.)

Add a couple of other features:

 o Give it an mbox file (or Outlook folder, etc.) and it adds all the
   addresses to the whitelist.

 o Support wildcard patterns in the whitelist, eg. *@myemployer.com

and I think you have something that would be mostly automated.  You
wouldn't need to dig out all your acquaintances addresses and add them by
hand, because the act of training would catch many of them.  The ability
to add all the addresses in a folder would catch most of the rest (for
anyone that keeps a good deal of old email around, which I suspect is most
people, especially in a working environment).

The upshot: I still don't trust SpamBayes to delete my Spam without
looking it.  This feature would mean I *would* trust it, because I could
be sure that when one of my friends or colleagues sends me a spammy
message (cf. the list of US state names I received a while ago) it doesn't
get classified as spam.  I'm prepared to take the risk of forged From
addresses because the time spent weeding out those will be far less than
the time I currently take glancing down my entire list of ~150 spams per
day.  I'm prepared to take the risk that the first ever email a friend
sends me gets deleted as spam (very unlikely).  I keep all my old mail,
sorted into ham and spam, so generating my whitelist will be easy (and
even if you don't keep all your old mail, generating a training-based
whitelist for frequent correspondents, or adding wildcard patterns for all
work addresses, would be easy).

Other features we'd need:

 o Manual editing in web interface / an Outlook dialog - just a
   newline-separated list of names or wildcard patterns.

 o Import / export of whitelists as plain text files (choice of merge or
   replace on import)

Classification would just override whatever the classifier said, adding
"X-Spambayes-Classification: ham".  If you ask for evidence, you get
"X-Spambayes-Evidence: Whitelist rule '<rule>' matches 'From: <address>'".

Questions:

 o How to get the actual address from a To/From header - the address would
   need separating from the real name and any quoting.

 o Which headers to use?  Probably just From to keep it simple; maybe
   Reply-To as well.

 o Should there be a blacklist as well, for symmetry?  Probably not - a
   whitelist is far more useful.  A blacklist would only be useful if you
   were getting persistent false negatives from the same address despite
   repeated training - if that's happening then something's broken 8-)

 o Where to store the whitelist - it could get big, so bayescustomize.ini
   might not be the place.  Ongoing problems with DBRunRecovery errors put
   me off putting it in the clues database.

-- 
Richie Hindle
richie at entrian.com