[spambayes-dev] RE: [Spambayes] How low can you go?

Skip Montanaro skip at pobox.com
Mon Dec 22 14:55:41 EST 2003


    Seth> I would like to investigate whole message expiration with
    Seth> different training and expiration schemes.  From our previous
    Seth> discussion, it seems that the most flexible way to approach this
    Seth> is by going to a system with the several bidirectional maps
    Seth> implemented in the databases: feature_id <-> token, msg_id (+
    Seth> training timestamp) <-> feature_id and token database w/training
    Seth> timestamp per entry.  Instead of training timestamp, expiration
    Seth> time might be preferable.

I'll just toss out a thought with nothing really to back it up besides my
seat-of-the-pants experience.  You might find it easier to experiment with
different table layouts using SQL.  There are both MySQL and PostgreSQL
classifiers available (browse spambayes/storage.py).  You could add new
tables or new columns to existing tables without much fuss.  Also, hapax
expiration would be pretty simple.  (Add a last_used column, arrange for it
to get incremented whenever a row is fetched - fairly trivial with
PostgreSQL's triggers I think, then use it to expire hapaxes periodically.)
Finally, problems of multi-thread or multi-process access to the database
should go away.

Skip



More information about the spambayes-dev mailing list