[Spambayes] Multi-User configuration

Thu May 26 01:30:28 CEST 2005

> I'm looking to setup SpamBayes for a small network of users,
> about 25, who are all using Outlook 2002. I would ideally like
> the spam database to be shared across by everyone,

Is there a reason for that?  The very strength of filters like SpamBayes is
that they are trained to the individual.  It would also mean that
installation was simple.

[...]
> It looks like I could set up each client to look at a shared
> drive on a file server, and then just make sure that their
> profile name was unique (which is all easy enough). It seems
> to be working ok for the two test machines I have sharing the
> database. I created the initial database from from a large
> sample (about 400 spam and 400 good messages), and now each is
> able to add new spam or good messages to the filter as they go.

Basically, you're pointing both instances of the Outlook plug-in at the same
database file, correct?

[...]
> That's all fine, but I want to be sure if I will run into any
> trouble doing this on a larger scale, with all 25 users.

Yes, you will run into a lot of trouble.  You will find that the database
gets regularly corrupted (you'll find this will the two users, as well, but
it may take a bit longer).

SpamBayes doesn't have any support for concurrent access to the database,
which is what you are after here.  There are various ways you could achieve
this, but none are easy:

  1.  You could leave the plug-in running as it does by default, with
individual databases, and create a script that synchronises them at some
point (e.g. overnight) when they are not being used.

  2.  IIRC, some of the experimental database backends (mysql (1.0+),
postgresql (1.0+), ZODB/ZEO (1.1a1+)) can manage concurrent access.
However, you'd have to add additional code to handle this (with ZEO, at
least; I don't really know what the situation with the SQL ones are) -
basically if concurrent access is attempted, it will raise an exception,
which you can catch, wait, and try again later.

  3.  You could leave the plug-in running as it does by default, with
individual databases, and have a script that creates a fresh database from
messages in certain folders (easy with spam, hard with ham) overnight, and
replace the individual databases with it.

  4.  You could do the filtering server-side somehow (e.g.
<http://spambayes.org/server_side.html>), although I'm not sure how you
would work in the bit about the users doing their own training.

  5.  I think (but am not 100%) that if you used 1.1a1 (or ran from source)
and used a pickle for storage, that what would happen is that the database
wouldn't get corrupted, but you'd lose training data.  Each time the
database was saved to disk, it would become the valid copy, but the other
instances of SpamBayes wouldn't load that until they were restarted, so if
one of those then saved the database, the information would be replaced.

Essentially, unless you really do have good reason for wanting a shared
database (or have the resources to implement something like one of the
above), it would be best to leave the plug-in working with individual
databases.

=Tony.Meyer

-- 
Please always include the list (spambayes at python.org) in your replies
(reply-all), and please don't send me personal mail about SpamBayes.
http://www.massey.ac.nz/~tameyer/writing/reply_all.html explains this.