[Spambayes] GBayes spam filtering

Paul Svensson paul-bayes@svensson.org
Thu, 5 Sep 2002 17:18:13 -0400 (EDT)


I've been following the discussion on spam filtering on the python-dev
ist with great interest.  It looks very promising so far, but there's one
issue I would like to explore further: we don't all have a pre-filtered
corpus and a Tim or Brad to hand it to, to turn into a well tuned filter,
and even if we did, how often would we need to bring them back to re-tune
the filter as the flavor of spam changes over time ?

Thus my interest in the operational side of corpus-collecting.

This is more an issue for person-to-person email than for large mailing
lists, as the later are more likely to actually have a Tom or Brad available.

I don't think it's realistic to expect users to mark everything they
read as ham or spam.  For a single-user setup, I would consider a
mail reader command "delete as spam"; everything that's read and not
thusly marked would go in the ham list.  However, for a multi-user system,
I think something a little more sofisticated would be neccesary.

Here's my idea:

The message corpus database needs to contain, for each message,
	the message-id
	a timestamp (for removal of old stuff)
	the word count histogram
	a spam/ham flag

On SMTP receipt of a message, it's scanned, and if it smells like spam,
it's bounced.  It's NOT automatically added to the corpus.
If the message does not smell like spam, it's delivered, and
added to the corpus as ham.

When a user reads a message and find that it's spam that got thru the filter,
they need a way to send the message-id to the corpus, to flag it as spam.
At this point, it would be a good idea to compare the histogram of the new
spam to each histogram in the ham corpus, and remove any that are similar
(any good ideas how to do the comparison?), or maybe if they are VERY
similar simply flag them as spam.  After recomputing the filter from the
modified corpus, we could also re-filter the ham corpus, and remove more
newfound spam that way.

Characteristically of this system, the spam corpus will be reasonably clean
(assuming the users don't abuse it too much), but the ham corpus will be
quite dirty, containing spam that's not yet read, and spam that the recipient
didn't bother to mark.  I'm curious how GBayes would handle this situation;
I assume the false negative rate would go up, but how much ?

	/Paul