[Spambayes] IMAP filtering

Tony Meyer tameyer at ihug.co.nz
Fri Apr 16 18:47:23 EDT 2004


> 1. do to some unknown configuration bug in the exchange 
> server attachments from other exchange users cannot be read
> via IMAP or POP3 (attachments sent via SMTP can be read) so
> deleteing and re-posting these messages
> would have the effect of stripping the attachments from them

Note that if a message is classified as unsure/spam, then it also gets
recreated (IMAP doesn't provide any means to move a message), so you'd lose
this information then.  I doubt this would matter for true spam, but for
false positives or ham unsures, this could be problematic (although note
that the original isn't removed, just flagged for deletion).

> 2. I am extremely nervous about deleting, modifying, and re-posting
> messages that exchange uses for special purposes (calander scheduling
> messages are a prime example), while they show up as mail 
> messages, they really are slightly different

With your setup, do these appear in the same folder as mail messages?  Here,
for example, all my Exchange folders have either mail *or* non-mail, and so
I'd just filter those containing mail.  If they're scattered through the
same folders, though, then this could be a problem.

Actually, even without modifying the messages, this would seem to pose a
problem, because the filter will try and classify these messages.  I have no
idea how a scheduling message would be classified, but it's possible that it
would be non-ham and end up moving to unsure/spam, which is probably not a
good thing.  You might have to add some sort of code that identifies these
messages and skips them.

> the fix that I am thinking of to resolve this would be to 
> change how the IMAP filter tracks the messges it has processed.

This is certainly something that can/should be done at some point.
Unfortunately, while the IMAP filter is used by a number of people, there
isn't currently anyone who is taking a proactive role in developing it.  In
fact, there never has been - Tim Stone & I initially wrote it to alleviate
the frequent requests for such a filter.  I'm happy to maintain it (i.e. fix
bugs and do simple improvements), but since I don't actually use it for
day-to-day mail, I just can't find the time to take on a more active role
with it.  I'd certainly be happy to pass the torch on to someone else, but
no-one has stepped forward so far.

The result is that non-simple changes are unlikely to occur unless we get
patches (as I'm hoping you'll offer), and, especially important with IMAP,
people testing the changes.  I'd also want to hold off checking any patches
in until after 1.0 is out, since the current system is working reasonably
well (but 1.0 shouldn't be too far off now that we finally have a beta out).

> Instead of modifying the message itself if the filter tracked the
> highest message number that it has processed it can process only
> messages newer then that (the IMAP message ID is supposed to grow
> larger with time).

This isn't the ideal system, though.  The IMAP spec doesn't guarantee that
the UID will continue to grow larger with time.  At pretty much any point,
the server can decide to change the UID to anything it likes, as long as it
changes the folder's id at the same time.  This could be solved by using
some sort of combination of tracking the folder id and UID, but the folder
id isn't guaranteed to behave in any reliable fashion, either.  AFAICT (and
I and other people have gone through the RFC many times) there really isn't
any way to get IMAP to produce a unique, constant, id for each message.

Of course, any given IMAP server may actually do this, and many do.  But
some don't, and the idea with the filter is to support as many flavours of
IMAP as possible, which means that this isn't the way to go.  A similar
method is to store a custom flag with the appropriate information (this is
really the ideal way to go), except that not all servers support custom
flags.  For you, I suspect that Exchange does support custom flags
(instinct, not knowledge), so this might be a way for you to go.

>From past discussion, the best scheme that I've seen so far is:
  
  1.  If the message has a Message-Id header, then use that as the id for
the message.  This should be unique, will certainly be constant, and simple
checks indicate that it's present in most messages.

  1(a).  However, from other work with Exchange, messages from other
Exchange users may very well *not* have a Message-ID header if they're still
sitting on the server; I'm not sure - all the Exchange work I've done has
involved an Outlook client.  If they don't, then they might have some sort
of Exchange id that would work just as well; it'd be easy enough to check.

  2.  Otherwise, get a checksum for the message (using one of the routines
in the standard library) and use that as the id for the message.  This is
most likely to be unique (especially if you include the headers, although
you could have duplicates), and should be constant (because IMAP doesn't
allow message text/headers to be changed).

If you are interested in doing this, it might be worth reading through the
messages in the spambayes-dev archives that discuss this.  Googling for
"site:mail.python.org spambayes-dev imap" will get them (there aren't a lot
of spambayes-dev messages about the imap filter, so there shouldn't be much
else).  We'd certainly be interested in a patch.

> As an additional optimization, instead of running every x min as it
> currently does the filter could register itself with the server for
> specific mailboxes and have the server notify it when new 
> mail has arrived and process it immediatly (this also can produce
> less server load and network traffic then frequent polling for new
> messages, a win for both load and responsivness)

I presume this is something that Exchange lets you do?  AFAIK this isn't
something that regular IMAP4 can do, otherwise this would indeed be a better
way to do it.  If a patch to allow this didn't require too much refactoring
of the code, I wouldn't have a problem with including this as an option, for
those people in your situation.  Unless this is something that strict IMAP4
can handle I wouldn't want it in the main distribution under other
conditions, though.  In any case, there's certainly no reason why you
couldn't run a version patched like this yourself.

=Tony Meyer

---
Please always include the list (spambayes at python.org) in your replies
(reply-all), and please don't send me personal mail about SpamBayes. This
way, you get everyone's help, and avoid a lack of replies when I'm busy.




More information about the Spambayes mailing list