[Python-Dev] RE: [spambayes-dev] improving dumbdbm's survival chances...

Tim Peters tim.one@comcast.net
Sun, 13 Jul 2003 15:16:56 -0400

> I realize we (the Spambayes folks) want to discourage people from
> using dumbdbm, but for those who are either stuck with it or don't
> realize they are using it, I wonder if we can do a little something
> to help them out.
> As I understand it, if a machine crashes or is shut down without
> exiting Outlook, there's a good chance that the dumbdbm's _commit
> method won't have been called and the directory and data files will
> be out-of-sync.

This is so.  Worse, because spambayes never calls close() on its Shelf
object, it implicitly relies on dumbdbm.__del__ to rewrite the dir file, but
dumbdbm.__del__ can easily trigger a shutdown race in dumbdbm._commit
(referencing the global "_os" that has already been rebound to None by
shutdown cleanup), and the .dir file and .dat files on disk remain
inconsistent in that case.  (I fixed this race for 2.3 final, BTW.)

> It seems that dumbdbm doesn't support a sync() method which shelve
> likes to call. Shelve's sync method gets called from time-to-time by
> the Spambayes storage code.  dumbdbm.sync has this statement:

No, you're quoting shelve.py here:

>     if hasattr(self.dict, 'sync'):
>         self.dict.sync()
> so maybe it's as simple (short-term) as modifying
> dbmstorage.open_dumbdbm() to
>     def open_dumbdbm(*args):
>         """Open a dumbdbm database."""
>         import dumbdbm
>         db = dumbdbm.open(*args)
>         if not hasattr(db, "sync"):
>             db.sync = db._commit
>         return db

That would help spambayes a lot, because DBDictClassifier.store() does call
self.db.sync() on its Shelf at the important times.  It wouldn't stop the
shutdown race in dumbdbm._commit() from bombing out with an exception, but
for spambayes that would no longer matter to on-disk database integrity.
People using dumbdbm with spambayes would still be a lot better off using a
plain in-memory dict, though (on all counts:  it would consume less memory,
consume less disk space for the dict pickle, and run faster).

> The above should help.  Meanwhile, it appears that would be a good
> method to add to dumbdbm databases both for 2.3 and the 2.2
> maintenance branch.

Fine by me, although I doubt a 2.2.4 will get released.  Adding

    sync = _commit

to the 2.3 code (+ docs + test) should be sufficient.

BTW, this code in the spambayes storage.py is revolting (having one module
change the documented default behavior of another module is almost always
indefensible -- I can't see any reason for this abuse in spambayes):

# Make shelve use binary pickles by default.
oldShelvePickler = shelve.Pickler
def binaryDefaultPickler(f, binary=1):
    return oldShelvePickler(f, binary)
shelve.Pickler = binaryDefaultPickler