[Spambayes] Most Recently Received Email Not Filtered

Brown, Jim brown at terralign.com
Fri Feb 13 00:07:14 EST 2004

The discussion taking place under the subject "Identified as ham, but
sent to spam folder" referred to the problem where the most recently
received email is not filtered. I did some fairly extensive (although
amateurish) checking on this issue, and added the following to bug
report #793830:


Date: 2004-02-09 06:31Sender: brown2611Logged In: YES user_id=963745


I believe this is the same as bug #876281.


As far as I can tell, having only started looking at Python and the
SpamBayes Windows code yesterday, here is why this happens:


1. When SB starts, it (eventually) invokes
BayesManager.EnsureOutlookFieldsForFolder() for each folder. This method
finds the first MalItem in the specified folder and looks for the "Spam"
UserProperty attached to that MailItem. If the MailItem does not have
the "Spam" UserProperty, SB adds the UserProperty to the MailItem,
causing the MailItem to have a Spam score of 0.


2. In trying to process missed messages, SB invokes
MAPIMsgStoreFolder.GetNewUnscoredMessageGenerator(), which excludes any
MailItem that already has the "Spam" UserProperty from processing.


Voila. The item most recently received in a folder while SB was not
running is automatically assigned a score of 0 and not filtered.


Apparently, there isn't a clean way to detect if a folder has a
particular user defined field. So, the only way to check for the
presence of the user defined field in the folder is to checkthe items in
the folder. SB (correctly) assumes that adding the UserProperty to the
first MailItem will force the creation of the user defined field in the
folder. However, this is not a benign act.


Possible solutions:


* Don't force the creation of the user defined field until SB has an
actual score to store. However, I fear there may be a great number of
places in the code that assume the Spam field already exists. For
example, GetNewUnscoredMessageGenerator() certainly makes this
assumption. The error resulting from the lack of the Spam field could be
trapped, but I don't know the code well enough to find all the places
where the absence of the field might be a problem. Nonetheless, this
seems to be the correct solution to me.


* EnsureOutlookFieldsForFolder() could check more than just the first
MailItem in the folder. However, this doesn't avoid the problem if every
message in the folder is a missed message.


* If SB is going to force the creation of the Spam field, go ahead and
filter the message. Aside from not being terribly clean, it's not clear
to me that enough of the code has been initialized by this point to
filter a message.


* If SB is going to force the creation of the Spam field, initialize it
to a value that is easily detected as unscored, for example -1. However,
I'm not confident that the code doesn't depend on 0 <= Spam <= 100. In
addition, I don't know how many places in the code would have to be
changed to recognize this value.


* Change the minimum score to 0.0001, or the like, and detect a score of
0 as an unscored message. Kludgey and one would still have to find all
of the places where something with a ham-like score (or any score) is
excluded from further filtering (e.g., addin.ProcessMessage()).


* Since OutlookAddin.ProcessMissedMessages() only _seems_ to be invoked
at startup, it could be modified to always process the first MailItem if
the item is unread and has a score of exactly zero. Again, kludgey, but
at least the kludge is confined to a single place. On the other hand, it
looks to me like SB has already hooked into the folders it is watching
at this point. if SB is filtering in the background, I suspect that any
mail received before the "processing start delay" expires would bump the
improperly flagged message out of the first position. Hmmm.


* One could treat anything with a score of 0 as an unscored message, but
it isn't really desirable to rescore all of those messages, since the
majority of them have presumably already been filtered correctly.


Any feedback from someone who actually knows the code?





-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/spambayes/attachments/20040213/2d833710/attachment.html

More information about the Spambayes mailing list