[Spambayes] progress on POP+VM+ZODB deployment

Fri Oct 25 22:19:25 2002

I don't know if anyone else on Earth wants to manage their mail the
same way I do.  I've made some progress on hooking my mail up to
spambayes, however, and wanted to report on the deploment issues.

I read my mail with VM, an emacs mail reader.  My mail collects on a
couple of POP servers, and I fetch the mail directly from the POP
servers using VM.

I addressed the following issues:

- Incremental training from VM folders
- Scoring via a POP proxy
- Management of training data using ZODB

(I don't know if the last part was necessary or not, but I wanted to
use ZODB.  I think it's simplified some things.)

The runtime environment is fairly complicated.  It's got more moving
parts than I would like, but I don't know how to eliminate any of
them.  It's also slower than I would like, but I haven't done enough
profiling to really understand why.

There are a few open issues:

- It was hard to use the classifier module with ZODB because of the
  __slots__.  I ended up using the WordInfo objects unchanged, and
  __slots__ there helped minimize storage.  But I wanted to make the
  Bayes class persistent and I couldn't do that because of the slots.
  Since there's only a single Bayes instance, I can't see why it needs
  to use __slots__.

- It thought it would be nice if spambayes was a package, so I could
  separate it from my code.  It can't work as a package, though,
  because it contains a copy of the email package.  When I turned
  spambayes into a package, it ended up treating email as a
  subpackage.  My apps ended up getting two copies of the email
  package loaded -- one from the std library and one as a subpackage
  of spambayes.  The duplication broke a bunch of isinstance() tests.

- Configuration.  It would be nice to use the existing options
  framework and extend it with application-specific options (like the
  POP ports, the ZEO server location, etc.).  It isn't clear what the
  best way to extend Options is.

The different components involved in the setup are:

- A ZEO server managing a ZODB database.

  I have a long-running ZEO server process.  By using ZEO, multiple
  clients can access the database at the same time.  Clients connect
  to the server using a Unix domain socket.

- A persistent mail profile based on VM folders.

  The profile is stored in the database.  A VM folder is just a Unix
  mailbox.  A config file contains a list of folders that contain ham
  and a list of folders that contain spam.  The profile manages these
  folders and a spambayes classifier.

- A training program, update.py.

  The training program scans the folders listed in the profile.  When
  it finds new messages, it learns from them.  When it finds that a
  message was deleted, it unlearns it.  This process is incremental,
  but it depends on the mailbox module to parse the folders.  The
  parsing is definitely slow -- especially for large folders.

- A POP3 proxy

  I wrote my own proxy based on SocketServer.ThreadingTCPServer.  I
  don't like the asynchat style of programming, and I was having
  trouble integrating pop3proxy with ZEO.  They both use ZEO, but the
  way they use them seemed to be causing deadlocks :-(.

  The proxy uses the strategy as pop3proxy, intercepting messages and
  adding a spam score header.  I add a header like this:

     From: Martijn Pieters <mj@zope.com>
     To: <geeks@zope.com> (Zope.Com Geeks)
     Cc: sa@zope.com
     Subject: [Zope.Com Geeks] Zope.org storage server was down..
     Date: Fri, 25 Oct 2002 17:10:42 -0400
     X-Spambayes: 0.001

  The proxy doesn't do anything other than add the header.

- A set of VM filters and tools for handling spam and training.

  I wrote some little elisp functions.  One saves a message to the
  spam training folder and deletes it.  Another saves a message to
  the ham training folder, but does not delete it.  A third pipes it
  to a small Python script that prints out the evidence for a message.

  The next step is to add autofoldering rules that file spam above a
  certain threshold to the spam folder and messages in the middle to
  an unsure folder.  That's a standard VM thing, but I haven't done it
  yet.

The total code base is about 2000 lines of code, half of it in the POP
proxy.  I'd be happy to check it in to the spambayes project if anyone
else wants to try to use parts of it.

Jeremy