[spambayes-dev] Spontaneous training in Outlook addin?

Tim Peters tim.one at comcast.net
Sun Jul 27 22:43:17 EDT 2003


[Tim describes some oddities, Mark Hammond thinks hard]

I'm afraid I can't make time to help now.  I switched to bsddb3 in the hopes
I could provoke a corrupt database, or one of the infamous assertion errors,
since I've had more luck than most in fixing stuff like that.  So I've been
paying detailed attention to everything the addin does.  I haven't done that
with the pickled dict classifier in a long time, so I can't be sure any of
this is related to switching data stores.

...

>> Message 'Re: [Python] Re: [OT] On the TimBot' had a Spam
>> classification of 'No'

> So here, we are seeing the OnItemAdd() for the item.  We have decided
> it truly is "new", so we score it.

Or, I guess, the wrong subject line is getting associated with a truly new
item.

>> Training on message 'Re: [Python] Re: [OT] On the TimBot' -
>> trained as good

> So here is appears we are seeing the event again!  Correctly (given
> the log), we detect we have previously seen the message, so assume it
> is a d&d move, and therefore we do an incremental train.

Ah!  That sheds light.  No understanding, but less dark <wink>.

>> This is curious for two reasons:  (1) I never told spambayes to do
>> anything with the TimBot message; and, (2) That message is old!  All
>> the other messages it's reporting on did arrive in this Outlook
>> session, but the TimBot message it decided to train on arrived days
>> ago.

> Wow!  How strange.  We have seen *2* unexplained "new item" events
> for this message.  I wonder if it is somehow a *copy* of the message?
> Or somehow Outlook thinks it is - or something.

Can't say yet, but I'll keep my eyes open.

Mentioning copies may be important to the next one:  because I keep my
training ham in a folder dedicated to that, sometimes I drag a copy of a ham
message into that folder.  So it's possible that I also (but rarely) end up
saving the original there too, or even put more than one copy into it.

> ...
>> Another oddity I never saw when using a pickled dict:  I asked the
>> addin to rebuild the database from scratch.  This gave:
>>
>> """
>> Checked 357 in folder Ham - 354 new entries found.
>> Checked 771 in folder Spam - 771 new entries found.
>> Saving bayes database with 771 spam and 354 good messages ...
>> """
>>
>> There are in fact 357 msgs in my ham training folder (I have only
>> one, off in a separate .pst file holding my ham and spam training
>> data).  Why would it think only 354 of them are new?  Maybe that
>> also casts suspicion on how we're keeping track of messages.

Since then I went back to a pickled dict, and tried this again.  Same
outcome (3 "mystery ham" vanish).  So this one definitely has nothing to do
with bsddb3.

> The addin uses PR_SEARCH_KEY for the message, as the entry ID changes
> when the message is moved.  The only thing I can think of is that
> Outlook is giving the same ID to multiple messages (possibly even
> when they are dupes of the same spam). But this doesn't explain why
> only bsddb sees it.

Luckily, turns out it's good that didn't explain a non-truth <wink>.

> But FWIW, I *have* seen this - just never got a round tuit.  I guess I
> should :)  My first step would be a new sandbox tool that searches
> your store for duplicate search keys.

No rush.  I'll live for at least another year.


General observations on the switch from pickled dict to bsddb3:  painless;
no killer problems; the addin still works great; the on-disk database is
about twice as big and I don't care; startup time is much faster even though
my database is relatively small; I've seen the addin miss scoring msgs for
the first time on an otherwise quiet machine when starting Outlook and a
large pile of email comes in (presumably this is because the startup timing
is so different, and presumably also your new timer code will help that
(I'll try it)); time for retrain-from-scratch is much longer; Outlook VM
size is substantially smaller (but not amazingly smaller, presumably because
my database is pretty small).  So no real surprises, just a couple oddities
I simply may not have noticed before.




More information about the spambayes-dev mailing list