[Mailman-Developers] (long) queue problems: an analysis

Scott scott@chronis.icgroup.com
Fri, 2 Oct 1998 15:55:30 -0400

the mailman outgoing mail queue has a number of concurrency-control issues. 

here's an overview of the processes involved in the queue, from what I
can tell so far:

1) before a delivery is attempted, the message is queued.  This is a
   good idea because if there is an unforeseen exception that kills
   the delivery process, we want the data already on disk so that it
   can be delivered at a later time.

2) each time a delivery is requested via the contact_transport script,
   the entire mail queue is rerun.

3) when there are subscribers that belong belong to more than one
   domain, there are potentially multiple contact_transport processes
   running concurrently per post as per the forking in the deliver

anytime there is more than one contact_transport script being run,
there is the possibility that one of those processes is in the middle
of delivering a message (that's already been queued and not yet
dequeued) while another one reads that queue entry and assumes that
since it's in the queue, it needs to be delivered and attempts to do
the delivery.  while both probably deliver the same message (producing
duplicates) whichever of the two processes finishes the delivery first
unliks the file, and the remaining one fails to do so creating
exceptions in logs/errors like this:

Oct 02 12:01:23 1998 contact_transport: Traceback (innermost last):
contact_transport:   File "/home/mailman/scripts/contact_transport", line 60, in ?
contact_transport:      OutgoingQueue.processQueue()
contact_transport:   File "/home/mailman/Mailman/OutgoingQueue.py",line 38, in processQueue
contact_transport: Utils.TrySMTPDelivery(recip,sender,text,full_fname)
contact_transport:   File "/home/mailman/Mailman/Utils.py", line 226,in TrySMTPDelivery
contact_transport:      OutgoingQueue.dequeueMessage(queue_entry)
contact_transport:   File "/home/mailman/Mailman/OutgoingQueue.py",line 25, in dequeueMessage
contact_transport:      os.unlink(msg)
contact_transport: os . error :  (2, 'No such file or directory')

the same problem can occur when a run_queue process runs concurrently
with another run_queue process or a contact_transport process and
produces the same traceback from a different top level:

Oct 02 09:10:16 1998 smtplib: Traceback (innermost last):
smtplib:   File "/home/mailman/cron/run_queue", line 31, in ?
smtplib:      OutgoingQueue.processQueue()
smtplib:   File "/home/mailman/Mailman/OutgoingQueue.py", line 38, in processQueue
smtplib:      Utils.TrySMTPDelivery(recip,sender,text,full_fname)
smtplib:   File "/home/mailman/Mailman/Utils.py", line 226, in TrySMTPDelivery
smtplib:      OutgoingQueue.dequeueMessage(queue_entry)
smtplib:   File "/home/mailman/Mailman/OutgoingQueue.py", line 25, in dequeueMessage
smtplib:      os.unlink(msg)
smtplib: os . error :  (2, 'No such file or directory')

In addition to this, there are permissions problems that can arise:
when a contact_transport script is called, it is called from the mail
process, and is set with group id mailman and whatever userid the
calling process hands to it.  it creates a queue file whose owner is
the uid of the process. later, when run queue is run, it is run with
mailman's uid and consequently cannot do may things to the queue file,
like unlink it.  If the file isn't written with group read
permissions, it can't read it either, and you get tracebacks like

Oct 02 12:50:03 1998 smtplib: Traceback (innermost last):
smtplib:   File "/home/mailman/cron/run_queue", line 31, in ?
smtplib:      OutgoingQueue.processQueue()
smtplib:   File "/home/mailman/Mailman/OutgoingQueue.py", line 34, in processQueue
smtplib:      f = open(full_fname,"r")
smtplib: IOError :  (13, 'Permission denied')

So we need a mail queuing architecture that will address all these