[Spambayes] Mass Distribution for Training Set

Anthony Baxter anthony at interlink.com.au
Mon May 24 23:37:10 EDT 2004

Bahman Lashgari wrote:
> Hello!
> We are considering providing this plug-in to the entire office. However, 
> it is an extra overhead of teaching people how to run training sets and 
> they may not have enough emails for the spam category to build a good 
> and updated set.  Our question is this: can we configure one training 
> file and load the same training file on all machines as default set? In 
> this case, for example, the training file would be training.file and we 
> could copy and paste to all workstations. How would this work? Your 
> input is very much appreciated. Thank you.

Bear in mind that individual preferences may vary as to what's spam
and ham - having said that, if you've got a "work email is for work"
policy, that should be less of a problem. Selecting the correct
training set will be a bit tricky - you want something that's
typical of everyone's email.

You may find it appropriate to make a couple of different training
databases if you have distinct groups of users with distinct types
of email. For example, a finance department would probably deal with
messages containing terms like 'credit cards', 'cheapest' and 'payment',
while an engineering team would not.

I'd recommend a quite small initial training set - say about 30-40
of each (spam/ham). That way, if it _is_ sub-optimal for some users,
it won't be too hard for their training to overcome the default
training. As far as selecting the messages for the initial training
set - I'd start with an empty database, pick a couple of messages to
train on, then from your test set, train on the messages that are
furthest from being correctly scored - that is, pick the lowest
scoring spams and the highest scoring hams. Don't bother training on
messages that are already being scored perfectly (1.0/100% for a spam,
0.0/0% for a ham)

Hope this helps!
Anthony Baxter     <anthony at interlink.com.au>
It's never too late to have a happy childhood.

More information about the Spambayes mailing list