[spambayes-dev] RE: [Spambayes] Amazing sloth
Tim Peters
tim.one at comcast.net
Wed Apr 21 22:01:13 EDT 2004
[Tim, from a while ago]
>> Here's a weird one, w/ Outlook 2000 and the addin from
>> not-so-recent-anymore CVS. I decided to start over from
>> scratch today, so have a new (Berkeley) DB.
[Tony Meyer]
> This is with Outlook 2002 (SP2) and the addin also from
> not-so-recent-anymore CVS. I also started over (see the spambayes-dev
> message) from scratch today with a new (Berkeley) DB. Specifically,
> I just trashed the old database files (while Outlook was closed) and
> started training as things arrived in the unsure folder.
>> It's taking the addin from 4 to 10 seconds to score each(!)
>> message. That's whether it's new incoming email, or via the
>> "Filter messages ..." menu item, or via a single "Show spam
>> clues". It's mind-numbingly slow.
> And, no surprise since you've read this far, I found this too.
> Bizarre.
I should mention that it happened two more times for me after starting over
from scratch, with very few msgs trained on each time (certainly less than
50 total). At that point I got a new box with a gigabyte of RAM, and
switched to using a giant pickled dict instead. Much faster scoring, no
problems, but much slower Outlook startup time and incremental training
times.
>> While a message is being scored, Outlook is unresponsive to
>> keyboard or mouse input, but the process is using very little
>> CPU (typically a fraction of a percent, with very brief
>> spikes). So it's waiting on *something*, but don't know what.
>>
>> Nothing odd in the PythonWin Trace Collector display. Ran
>> scanpst on all the relevant .pst files -- no problems. The
>> sloth persists after restarting Outlook, and after a reboot.
>> No other Outlook operations have slowed, just SpamBayes.
> All of this applies to my experience as well, although I didn't try
> scanpst (I don't know if I have it, and since it didn't do Tim any
> good, it probably wouldn't have helped me anyway).
Whenever you see reference to the "Inbox Repair Tool", it means scanpst.exe.
I'm amazed that MS continues to make this thing so hard to find: .pst files
routinely get corrupted in minor and major ways by Outlook (whether or not
SpamBayes is installed), and scanpst.exe finds at least one problem in my
.pst files every day(!). You have scanpst.exe, but you may have to search
your disk to find it.
> Heh ;). I didn't spend two hours on it, though. I remembered Tim's
> message and so after about 5 minutes just started with new db's again.
>> The sloth went away then, just as mysteriously and
>> dramatically as it appeared. Outlook remained open the entire time:
>>
>> extremely slow
>> retrain on 5 new ham and 5 new spam from scratch
>> zippy again
> I started afresh (from training 1 ham and 9 spam) also, but in the
> same way as before - close Outlook, move aside slow db's and start
> Outlook again. Also zippy once again.
>> So no clues, just bizarre symptoms. If it happens to you,
>> don't be an idiot like I just was: save the .db file before
>> retraining the problem away (it's the only relevant thing I
>> can think of that changed).
> Normally I would have done just that, but I recalled this message (me
> who struggles to remember what I had for dinner last night! <wink>)
> and so have it zipped away for analysis.
>
> So, I offer it up to anyone interested in looking into it, or offer
> myself up to spend time looking into it if someone can suggest ways
> of doing that. I don't really know where to start.
Since I moved to a giant pickled dict, I don't care anymore <0.5 wink>. An
interesting experiment would be to open it directly from a non-SpamBayes
Python program, and just time lookups and inserts.
There was a disturbing Python bug report against bsddb that I closed as
hopeless:
http://www.python.org/sf/881522
This was about a huge slowdown in shelve after several thousands of keys had
been added. There were strong hints that the huge slowdown was specific to
the combination of:
"a modern" bsddb (after the ancient 1.85)
Windows
the hash flavor of bsddb
There were also hints that the BTree flavor of bsddb was faster than the
hash flavor, independent of the mystery-slowdown in the hash flavor.
Since we experienced Amazing Sloth under different versions of Outlook, and
very different OSes, my top guess has to be that the fault is in the dbhash
flavor of bsddb.
More information about the spambayes-dev
mailing list