[spambayes-dev] Re: [Spambayes] fatal error?

Tim Peters tim.one at comcast.net
Tue Aug 26 22:58:07 EDT 2003


[Skip]
> I suspect that the Outlook plugin simply makes it easier to find
> problems (more users, more worm mail, more concurrent threads,
> whatever).

Is that relevant?  I've never seen a database corruption complaint from
someone using the Outlook addin (did I miss one?), and I deliberately
switched my 3 classifiers to Berkeley in order to try to provoke one.  No
luck.  IIRC, Mark has never seen this either.

The first message in this thread:

http://mail.python.org/pipermail/spambayes-dev/2003-August/000873.html

was copied to spambayes-dev from some other source, and was missing
sufficient context to tell what it was talking about.  Trying to track the
source down probably leads to here:

http://mail.python.org/pipermail/spambayes/2003-August/007311.html

If so, the OP was running on Windows, but was almost certainly not using the
Outlook addin:

   Now I'm getting an error message in the email my
   headers: X-Spambayes-Exception: bsddb._DBRunRecoveryError
   ((-30982, 'DB_RUNRECOVERY: Fatal error, run database recovery --
   fatal region error detected; run recovery')) in __getitem__() at
   C:\PTYTHON23\lib\bsddb\__init.py line 86: return self.db[key]

The Outlook addin never inserts email headers, so I don't believe that
fellow's problem had anything to do with the addin.

> I think the same (or a similar) problem would exist were two
> instances of hammiefilter running at the same time, both trying
> to update the file.  I'm just fortunate enough to have never
> encountered that problem.  Even using a pickle, you really ought to
> use some sort of lock protocol when reading or writing the pickle
> file if there's any chance of concurrent access by another process or
> thread.  That you only read it at the beginning and write it at the
> end only limits the opportunity for collision.

Python dicts are safe for multiple-reader single-writer access without
explicit synchronization, and per-access locks are so bloody expensive that
I don't want to change anything in the absence of proof that there's a
problem that can't be wormed around more cheaply.  To date, I don't believe
we've seen any report of corruption via the Outlook addin, which suggests
it's doing something right <wink>.

> I just (re)ran a little experiment.  (I'm sure we've done this in the
> past.) I took my current hammie.db (153685 keys, no hapaxes, the
> result of processing 11,000+ hams and 8,000+ spams) and converted it
> to a pickle using dbExpImp.  Startup time is dramatically different:

Of course.

>     % time python -c 'import pickle ; db =
> pickle.load(open("hammie.pck"))'
>
>     real    0m32.193s
>     user    0m22.850s
>     sys     0m0.430s
>     % time python -c 'import cPickle ; db =
> cPickle.load(open("hammie.pck"))'
>
>     real    0m5.650s
>     user    0m3.720s
>     sys     0m0.350s
>     % time python -c 'import shelve ; db = shelve.open("hammie.db")'
>
>     real    0m0.155s
>     user    0m0.050s
>     sys     0m0.050s
>
> This is not to imply that my huge database is typical or that my
> usage of hammiefilter is either.  Using pickles for moderately sized
> training databases would probably work, regardless of the
> application.  With long-running SB apps like the Outlook plugin or
> pop3proxy, pickles are probably the way to go.  (Maybe it's time to
> give up on hammiefilter altogether.)

I don't know about hammiefilter (haven't used it).  I'll remind that the
original spambayes design was done with the expectation that the "big dict"
would eventually be replaced by a BTree stored in ZODB.  That's still a
nearly perfect database for spambayes, although only Jeremy pursued it (I
continue to feel guilty about it, though <wink>).




More information about the spambayes-dev mailing list