[spambayes-dev] SQL classifier

Skip Montanaro skip at pobox.com
Tue Jul 29 16:45:59 EDT 2003

I added a trivial PGClassifier class which uses PostgreSQL as the database
to spambayes/storage.py.  I haven't messed with it much yet, but a couple
things come to mind:

    1. The hammie command line interface only allows two different save
       modes, dbm-style or pickle-style.  I faked it for the moment by
       changing DBDictClassifier to PGClassifier.

    2. Since real databases can do commit and rollback it would be nice if
       this was exposed at the classfier level (maybe begin, commit and
       rollback methods).  Commits or rollbacks could be performed on a
       per-message basis, giving extra resilience in the face of errors.

Has anyone considered either of these issues?  I'd like to solve the first
one now so Spambayes can cleanly support more than just two types of
classifiers.  I think it will require a change to the command line args of
the various (Unix-based) tools.  They're getting pretty baroque as it is
now.  Maybe it's time for a cleanup (or maybe they should get more baroque
in the interests of backward-compatibility).

Right now with hammie you give database args of either

    -d -p FILE                  use dbm store found in FILE
    -D -p FILE                  use pickle store found in FILE

On the other hand, hammiefilter, pop3proxy and imapfilter use

    -d FILE                     use dbm store found in FILE
    -D FILE                     use pickle store found in FILE

It seems to me that we should probably change hammie's command line
interface to match the other Unix-style apps.  We could then add

    -S dsn

to specify that an SQL connection be made with the given dsn (data source
name, e.g., "host=localhost dbname=skip").  The dsn could be suitably
mangled to select appropriate databases, for instance:

    -S "PGSQL:host=localhost dbname=skip"


    -S "MYSQL:host=remote dbname=bayes"

(which would require some parsing to mash into the form MySQLdb likes).



P.S.  No, I haven't forgotten about my trim_email_parameters idea.  This
occurred to me as I was waiting for about 20 minutes for a database save on
a full retrain (obviously this wouldn't be a problem with an SQL database).
Maybe it's time for me to trim my training database.

More information about the spambayes-dev mailing list