[Spambayes] An alternate use

Sat Nov 2 04:41:00 2002

A couple things have been kicking around in my head, and they've
managed to come together in an interesting configuration and stick,
so I'm going to make a quiet little proposal and see how much
thunder it generates.

First off, the observations:

1. Based on recent reports, spambayes works better when given full
   data about everything that comes through, not just the mistakes.
   This is predicted by the theory, too.

2. spambayes is extremely sensitive to changes in the nature of
   ham, and is moderately likely to classify any new topics/venues
   as spam.

3. spambayes is still a techie toy (though perhaps not for much
   longer).  People with a little knowhow are going to have a
   much easier time training it than the average joe.

4. We want a large penetration into the mail-reading populace,
   to better force the spammers to change tactics.

5. Many people read mailing lists.  In fact, for high volume
   mail users, mailing lists probably make the majority of
   their incoming mail (or at least their incoming ham).

6. A noticable amount of spam gets relayed through mailing lists,
   and most personal filters are notoriously bad about passing
   it through because it comes from a whitelisted intermediary.

6. Most mailing lists keep archives of everything sent over the
   list.

7. Most mailing lists are single-topic, and anything off-topic
   is unwanted.

So, what I propose is that we specifically target mailing list
managers (mailman and ecartis being the two obvious first
targets) for spambayes integration.  I see two main modes for
this: just adding headers for the less intrusive, and actually
rejecting or forcing moderation for the heavily policed.

Training is easily accomplished by taking the list archives
as a ham corpus and one of the spam collections floating
around as a spam corpus.  Run the classifier over the training
data to kick out all the false positives and false negatives
for possible resorting, then retrain.  Only the list owner
has to be techie to do this, and list owners are more likely
to be techie than not (they set up a mailing list, after all).
Periodic retraining can be handled in the same way.

In the case of adding headers, we'll want to avoid collisions
with personal use of spambayes, too.  I suggest tagging the
X-Spambayes-Disposition header (or whatever we call it) with
some identifier for which classifier generated the rating,
so that multiple X-Spambayes-Disposition lines are distinguishable.
Something like:

  X-Spambayes-Disposition: Spam by spambayes@python.org
  X-Spambayes-Disposition: Unsure by pennmush@pennmush.org

Personal classifiers could leave off the 'by' section.

Heck, make it so that X-Spambayes-Disposition lines are turned
into words similar to the mailer lines, and then personal
classifiers can use the judgements of list classifiers as clues.

Doing this sort of integration into mailing list managers takes
advantage of some 'weaknesses' of spambayes, and could be of
great benefit to many people beyond just those with the
wherewithal to train and run the filter.

- Alex