[spambayes-dev] Spontaneous training in Outlook addin?
mhammond at skippinet.com.au
Mon Jul 28 11:29:43 EDT 2003
> One thing I've noticed, both at work and at home, is that the
> Trace Collector window occasionally shows an instance of "spontaneous
> training": a single msg that's never been moved (still
> sitting in my inbox
> or Unsure folder) gets trained as Ham or Spam "for no reason at all".
Strange. Interestingly, this isn't as part of "missed message" processing.
Also, thankfully you haven't got my new timer code yet <wink>.
So this means that the folder in question (which I presume is your Inbox) is
getting an "OnItemAdd" event for this item. This is our main entry point.
This ends up calling ProcessMessage, which then attempts to determine if we
have previously seen the message. If we have, we train.
> Pay attention to the next one:
> Message 'Re: [Python] Re: [OT] On the TimBot' had a Spam
> classification of
So here, we are seeing the OnItemAdd() for the item. We have decided it
truly is "new", so we score it.
<snip a couple of more msgs being scored>
> And then:
> Training on message 'Re: [Python] Re: [OT] On the TimBot' -
> trained as good
So here is appears we are seeing the event again! Correctly (given the
log), we detect we have previously seen the message, so assume it is a d&d
move, and therefore we do an incremental train.
> This is curious for two reasons: (1) I never told spambayes
> to do anything
> with the TimBot message; and, (2) That message is old! All the other
> messages it's reporting on did arrive in this Outlook session, but the
> TimBot message it decided to train on arrived days ago.
Wow! How strange. We have seen *2* unexplained "new item" events for this
message. I wonder if it is somehow a *copy* of the message? Or somehow
Outlook thinks it is - or something.
> any other reports of it. Anyone else? My first suspicion
> was that we're
> doing something wrong in the bsddb3 version of the message id
> database, but
> that wouldn't (AFAIK) explain spontaneous training.
I've never seen it.
> Another oddity I never saw when using a pickled dict: I
> asked the addin to
> rebuild the database from scratch. This gave:
> Checked 357 in folder Ham - 354 new entries found.
> Checked 771 in folder Spam - 771 new entries found.
> Saving bayes database with 771 spam and 354 good messages
> There are in fact 357 msgs in my ham training folder (I have
> only one, off
> in a separate .pst file holding my ham and spam training
> data). Why would
> it think only 354 of them are new? Maybe that also casts
> suspicion on how
> we're keeping track of messages.
The addin uses PR_SEARCH_KEY for the message, as the entry ID changes when
the message is moved. The only thing I can think of is that Outlook is
giving the same ID to multiple messages (possibly even when they are dupes
of the same spam). But this doesn't explain why only bsddb sees it.
But FWIW, I *have* seen this - just never got a round tuit. I guess I
should :) My first step would be a new sandbox tool that searches your
store for duplicate search keys.
> I'm also disturbed that the 'Animal Perversion' msg got rated as spam,
> although I probably shouldn't admit that <wink>.
Just keep training on it - you will get there <wink>
More information about the spambayes-dev