[Spambayes] ageing out database entries

Fri Nov 14 14:09:13 EST 2003

[Kenny Pitt]
> In K9, the limits you're talking about only control how many complete
> messages of each type are stored in cache for future *re-training*.
> They do not affect the contents of the actual training database.  K9
> does not currently do any aging of the training data, although I believe
> it has been discussed in that context as well.

Today I just saw the following entry in the K9 configuration instructions:

"In addition to automatically cleaning up the Recent Emails list, you can
choose to clean out the Good and Spam folders of old emails when they reach
a certain size."

You can see the K9 configuration instructions at

http://keir.net/k9_configuration.html

It looks like K9 *does* age out old messages once the message counts for
spam or ham are reached and a new message is available for training.  It
appears that they train on everything, and when the maximum number of
messages in the corpus is reached, they make room for the new message by
deleting and untraining the oldest one.  This appears to be another reason
why K9 stores the two complete training corpuses.  This avoids having to
timestamp every token, which is a big saving.

The advantages I see to this scheme are:

1) It allows you to set a predetermined desired number of messages for your
spam and ham training sets.  It appears from looking at the archives that
there were some "magic ratios" of spam/ham, or sweet spots, where SpamBayes
performed better for reasons that no one will probably ever understand.

2) Depending on the nature of your mail traffic, it will prevent the
spam/ham ratio from becoming pathologically skewed without any overt user
action.

3) It allows you to keep the tokens current with you email stream.

The disadvantages I see to this scheme are:

1) If you have infrequent correspondents who send messages very atypical of
your other ham, you would have to set the maximum number of ham messages
very high.  The discussion about the effects of large training sets aside,
this would mean a very large ham corpus file and a bigger than normal token
database.

2) If you have a highly asymmetrical spam/ham mail stream, keeping the spam
and ham training sets to a fixed size will prevent the training sets from
being contemporaneous.  I have no idea if this is actually a problem.

3) The ham message corpus file, in particular, could get very large
depending on whether SpamBayes does anything with file attachments.  If
SpamBayes does not need the complete file attachment to untrain a message,
this would not be an issue.

Whether or not any of this is worthwhile for SpamBayes hinges on questions
that I pose to those who have done experiments:

1) Is there any advantage in keeping the total training set size and
spam/ham ratio of the training set fixed?

2) Is it better to train on mistakes (the current system) or train on
everything (except list traffic that Outlook rules get out of the way before
SpamBayes sees new messages)?

---------------------------------

Why am I asking these questions?

[Tim Peters]
>>> I suggest you wait until you have a real problem before trying to
>>> solve it.
>
>What *bothers* you about SpamBayes?  What doesn't work right, or what was
>too hard to figure out, or what's still too confusing?  What's missing?

Well, what bothers me, so far, is that despite training on 620 ham and 1403
spam, SpamBayes still manages to miss (score as ham) 5-10 messages per day
out of around 150 scored messages.  Most of these missed spams have an
initial score very close to zero, so simply lowering the ham threshold would
not fix it.  After training as spam, their spam score often increases
respectably, but sometimes, the score stays below 5%.  This indicates that
the same message would be missed next time, as well.  I don't know if I just
need to get a bigger or more balanced training set, if there are some types
of tokens (such as embedded URL's in HTML spam) that are not currently
parsed or if this is just as good as it gets.  Anyway, that's what I would
like to see improved and it is the motivation for the above discussion.

Any thoughts from those who've been there already?

--
Seth Goodman

  Humans:   please remove ".delete" to reply

  Spambots: please disregard the above