[Spambayes] Does anyone care about this report?

Skip Montanaro skip at pobox.com
Wed May 14 11:53:40 EDT 2003


    >> That's what I do as well, for filtering. But filtering is a read-only
    >> process with respect to hammie.db, right?

    Alex> I do both filtering and training from procmail, so answering the
    Alex> pure-filtering question is rather moot for me.  However, I _think_
    Alex> that filtering is currently a read-only process.  

Yes, I believe the database is opened read-only for classification.

    >> As for training, which obviously must modify hammie.db, I think it's
    >> safe to assume that there will only be one training process going on
    >> at a time (manually, via cron, etc.).

    Alex> I cannot make this assumption, since I train both through procmail
    Alex> and from cron.  I could simplify my situation slightly by having
    Alex> my cron job send me a mail which got picked up by procmail to do
    Alex> the full retrain (and then use procmail's locking for all access
    Alex> to the db), but that's a bit more convoluted than I want to deal
    Alex> with.

How about wrapping your cron training in a shell or Python script which uses
the same sort of file locking as procmail?  That way they'd play nice
together.  Another option is to train into a separate database.  (See
below.)

    >> The real question is: What happens if email comes in while training
    >> is in progress? That's the exact question at the end of David
    >> Abrahams' document, but it seemed like a different question was
    >> getting answered.

    Alex> I thought that I had answered it with "I don't know, it depends on
    Alex> the precise implementation of the db".  Without knowing the
    Alex> internals of the db implementation, I cannot say if a read would
    Alex> fail if a write was in progress at the same time.  

Agreed.  In fact, without getting a little messy, Spambayes doesn't really
know "what lies beneath" anydbm.  It could figure that out using whichdb
then attempt to do the right thing for each of the different kinds of
databases, but there's no guarantee *anything* can be done.  Does anyone
know what the file locking properties of dumbdbm or dbm are?  What about
gdbm or dbm via the berkeley db package?  All are available through Python,
and thus susceptible to use by anydbm.  I think the ultimate solution has
got to come from higher up (e.g., implement the same file locking that
procmail uses).

That said, we could have Spambayes implement its own file locking scheme
which would (hopefully) work transparently on all platforms.  That would
avoid the issue of locking individual files altogether.  Most, if not all,
applications know soon after startup if they are going to need read/write
access or just read access to their database files.  They should be able to
create the appropriate kind of lock file which other Spambayes applications
would honor.

    Alex> On the other hand, those of us who keep all their mail for
    Alex> retraining anyway don't care all that much if the db gets
    Alex> corrupted; we can just rebuild the db in case of error and move
    Alex> on.

I retrain to a different file, then rename it or copy it into place.  I've
never had a problem (which is not to say I won't someday).  Clearly
retraining into the same file from which classification is done opens up a
much bigger window of opportunity for gremlins to sneak in.

David, are we making any progress on your question?  I think you'd have
gotten a bit quicker resolution on it had it not been hidden at the end of
your IMAP filter document.  I saw "IMAP" and hit the 'd' key.  I suspect
others did as well.

Skip



More information about the Spambayes mailing list