the mailman outgoing mail queue has a number of concurrency-control issues.
here's an overview of the processes involved in the queue, from what I can tell so far:
before a delivery is attempted, the message is queued. This is a good idea because if there is an unforeseen exception that kills the delivery process, we want the data already on disk so that it can be delivered at a later time.
each time a delivery is requested via the contact_transport script, the entire mail queue is rerun.
when there are subscribers that belong belong to more than one domain, there are potentially multiple contact_transport processes running concurrently per post as per the forking in the deliver script.
anytime there is more than one contact_transport script being run, there is the possibility that one of those processes is in the middle of delivering a message (that's already been queued and not yet dequeued) while another one reads that queue entry and assumes that since it's in the queue, it needs to be delivered and attempts to do the delivery. while both probably deliver the same message (producing duplicates) whichever of the two processes finishes the delivery first unliks the file, and the remaining one fails to do so creating exceptions in logs/errors like this:
Oct 02 12:01:23 1998 contact_transport: Traceback (innermost last): contact_transport: File "/home/mailman/scripts/contact_transport", line 60, in ? contact_transport: OutgoingQueue.processQueue() contact_transport: File "/home/mailman/Mailman/OutgoingQueue.py",line 38, in processQueue contact_transport: Utils.TrySMTPDelivery(recip,sender,text,full_fname) contact_transport: File "/home/mailman/Mailman/Utils.py", line 226,in TrySMTPDelivery contact_transport: OutgoingQueue.dequeueMessage(queue_entry) contact_transport: File "/home/mailman/Mailman/OutgoingQueue.py",line 25, in dequeueMessage contact_transport: os.unlink(msg) contact_transport: os . error : (2, 'No such file or directory')
the same problem can occur when a run_queue process runs concurrently with another run_queue process or a contact_transport process and produces the same traceback from a different top level:
Oct 02 09:10:16 1998 smtplib: Traceback (innermost last): smtplib: File "/home/mailman/cron/run_queue", line 31, in ? smtplib: OutgoingQueue.processQueue() smtplib: File "/home/mailman/Mailman/OutgoingQueue.py", line 38, in processQueue smtplib: Utils.TrySMTPDelivery(recip,sender,text,full_fname) smtplib: File "/home/mailman/Mailman/Utils.py", line 226, in TrySMTPDelivery smtplib: OutgoingQueue.dequeueMessage(queue_entry) smtplib: File "/home/mailman/Mailman/OutgoingQueue.py", line 25, in dequeueMessage smtplib: os.unlink(msg) smtplib: os . error : (2, 'No such file or directory')
In addition to this, there are permissions problems that can arise: when a contact_transport script is called, it is called from the mail process, and is set with group id mailman and whatever userid the calling process hands to it. it creates a queue file whose owner is the uid of the process. later, when run queue is run, it is run with mailman's uid and consequently cannot do may things to the queue file, like unlink it. If the file isn't written with group read permissions, it can't read it either, and you get tracebacks like this:
Oct 02 12:50:03 1998 smtplib: Traceback (innermost last): smtplib: File "/home/mailman/cron/run_queue", line 31, in ? smtplib: OutgoingQueue.processQueue() smtplib: File "/home/mailman/Mailman/OutgoingQueue.py", line 34, in processQueue smtplib: f = open(full_fname,"r") smtplib: IOError : (13, 'Permission denied')
So we need a mail queuing architecture that will address all these issues.
scott
On Fri, Oct 02, 1998 at 03:55:30PM -0400, Scott wrote: | the mailman outgoing mail queue has a number of concurrency-control issues.
[see previous post]
| So we need a mail queuing architecture that will address all these | issues.
Here's an idea:
we alter contact_transport so that it does not try to process the queue anymore. it would only deal with the delivery at hand.
we create a 2-part mailqueue inside mm_cfg.DATA_DIR/mqueue/{active,deferred}. when we enqueue a message for delivery, we put it in mqueue/active/<qfilename>. If the delivery succeeds, we unlink the file. If it fails, we rename the file to mqueue/deferred/<qfilename>. All mail queue files in active/ will be handled by a single process under the current delivery mechanism, so no concurrency control is necessary for active/ queue files. this would involve changes to TrySMPTDelivery, and the installation procedure.
we alter OutGoingQueue.enqueueMessage so that it can handle coming up with unique filenames under this 2-part mail queue mechanism.
we alter OutGoingQueue.processQueue so that it creates a site-wide queue_run lock file to prevent more than one queue run from happening at a time. this process will also check the active/ queue files for files whose modification/creation time is older than some configurable amount of time (on the order of 1hr-1day). For each of these files, it will rename them to the deferred/ part of the queue before proceeding to process them. These 'stale' queue files would only come about as a result of system crashes or memory errors or similar serious system related and unpredictable errors that can happen in the middle of an smtp transaction.
the above scheme works in theory only when run_queue uid is the owner of all the queue files and/or root. I believe that it is possible for queue files to be owned by both the uid of the cgi and the uid of the local mail delivery agent. If this is the case, then either run_queue will have to be run as root, or all processes creating a queue file will have to setuid mailman before creating the file. Are there any preferences on which of these two approaches would be best?
the above scheme should not effect delivery rates much at all, since the TrySMTP process would be the same except that it would have to add a rename() operation if delivery failed. There would be no contention over locks for most deliveries. deliveries that are deferred would be handled in a sequential manner, but even that should be ok since each message in the queue can have up to some very large number of recipients. (on an unrelated note - has anyone bumped up against rcpt limits with mailers yet?)
if there aren't any concerns over this approach, i'll go ahead and code it -- starting monday. should take a day or two to code and test.
scott
Scott, from what i know of the queue mechanism and error notices, i think your assessment of the situation is a good one (and a wise move, to take a comprehensive look at what's going on).
I think one thing that _might_ alleviate some of the difficulty concerns the permissions issue. I know that in many unices you can use setgid directories to ensure that files created in the dir inherit the group id of the directory. By setting the group id appropriately to that of the process that will be servicing stuff left on the queue, and making sure that the processes putting stuff in the queue have group permissions enabled, then the queue servicing process has access if the placing process fail to do the send, and the placing processes have access by virtue of owning the files.
I've used this setgid directory mechanism for many things with very good results - my only uncertainty is whether this behavior - that a setgid directory forces files created in the directory to assume the same group ownership as that of the directory - is common across all unices. Anybody know of prevalent contemporary unix systems where it doesn't hold?
Ken klm@python.org
i have coded the changes to the queueing mechanism, and checked in the relevant files (contact_transport, Mailman/Utils.py, and Mailman/OutgoingQueue.py) to the cvs tree.
Just wanted to note a couple of things that differ slightly between what i coded and the plan below:
a closer look at the code prompted me to use file metadata to denote whether a q entry has been deferred because it allowed the conveniance of continueing to use the tempfile module. Origonally, i tried setting the sticky bit on files that are in an active state, and found that on some OS's in some conditions it wouldn't let certain users set this, so i decided to use the setuid bit instead.
as per Ken's suggestion, the setgid data directory would work in conjuntion with some chmod'ing of the q entries. there's no need to muck with what programs get set{u,g}id to what as far as i can tell.
scott
On Fri, Oct 02, 1998 at 06:22:27PM -0400, Scott wrote:
| On Fri, Oct 02, 1998 at 03:55:30PM -0400, Scott wrote:
| | the mailman outgoing mail queue has a number of concurrency-control issues.
|
| [see previous post]
|
| | So we need a mail queuing architecture that will address all these
| | issues.
|
| Here's an idea:
|
| 1) we alter contact_transport so that it does not try to process the
| queue anymore. it would only deal with the delivery at hand.
|
| 2) we create a 2-part mailqueue inside
| mm_cfg.DATA_DIR/mqueue/{active,deferred}. when we enqueue a
| message for delivery, we put it in mqueue/active/<qfilename>. If
| the delivery succeeds, we unlink the file. If it fails, we rename
| the file to mqueue/deferred/<qfilename>. All mail queue files in
| active/ will be handled by a single process under the current
| delivery mechanism, so no concurrency control is necessary for
| active/ queue files. this would involve changes to TrySMPTDelivery,
| and the installation procedure.
|
| 3) we alter OutGoingQueue.enqueueMessage so that it can handle coming
| up with unique filenames under this 2-part mail queue mechanism.
|
| 4) we alter OutGoingQueue.processQueue so that it creates a site-wide
| queue_run lock file to prevent more than one queue run from
| happening at a time. this process will also check the active/
| queue files for files whose modification/creation time is older
| than some configurable amount of time (on the order of 1hr-1day).
| For each of these files, it will rename them to the deferred/ part
| of the queue before proceeding to process them. These 'stale'
| queue files would only come about as a result of system crashes or
| memory errors or similar serious system related and unpredictable
| errors that can happen in the middle of an smtp transaction.
|
| the above scheme works in theory only when run_queue uid is the owner
| of all the queue files and/or root. I believe that it is possible for
| queue files to be owned by both the uid of the cgi and the uid of the
| local mail delivery agent. If this is the case, then either run_queue
| will have to be run as root, or all processes creating a queue file
| will have to setuid mailman before creating the file. Are there any
| preferences on which of these two approaches would be best?
|
| the above scheme should not effect delivery rates much at all, since
| the TrySMTP process would be the same except that it would have to add
| a rename() operation if delivery failed. There would be no contention
| over locks for most deliveries. deliveries that are deferred would be
| handled in a sequential manner, but even that should be ok since each
| message in the queue can have up to some very large number of
| recipients. (on an unrelated note - has anyone bumped up against rcpt
| limits with mailers yet?)
|
| if there aren't any concerns over this approach, i'll go ahead and
| code it -- starting monday. should take a day or two to code and
| test.
|
| scott
|
|
|
|
|
| _______________________________________________
| Mailman-Developers maillist - Mailman-Developers@python.org
| http://www.python.org/mailman/listinfo/mailman-developers
|
participants (3)
-
Ken Manheimer
-
Scott
-
Scott