b6, postfix/qrunner super disaster
Hi all. I know that one man's disaster is another man's chuckle at an incompetent amateur system administrator, but here goes.
I am running 2b6 under Mandrake 7.1 using postfix as an MTA. Last Thursday I posted a note to a small mail list but the note never showed up. I posted to the users list on that matter. First off I addressed the locking problem and deleted the locks/ files. That did not solve the problem, so I then looked at the smtp log and saw that Mm was trying to send the message, but was getting a return of: host not found when trying to send the 20 copies of the message. I messed around with it, and saw qrunner trying to resend the message every minute, and figured that it must be a temporary DNS problem with my ISP and left it alone. During this process I monitored both the normal qrunner cron operations and also tried to manually push the queue by evoking the qrunner command line that is found in the cron file.
Then I left for four days in San Francisco.
When I got back I discovered that I had 20 new sworn enemies. Sunday morning, as if by magic, the mail actually got delivered, 1400 copies of it. Now I realize that I may have done something really stupid along the way, but I also think that it may be worthwhile to figure out what happened. I am wondering if qrunner got the error message and kept the item in qfiles, but postfix also deferred delivery of the message and kept it in the MTA mqueue -- growing by one copy a minute until the server was able to successfully find the recipients' hosts.
If anyone would like to do some forensics on this I would be happy to share log file data, both from Mailman and the regular mail log. Thanks in advance for thinking about this problem and what the cause of it may have been.
--chris
--
/////\\\\\/////\\\\
Christopher G. Kolar
Director, Department of Instructional Technology
Aurora University, Aurora, Illinois
ckolar@admin.aurora.edu -- www.aurora.edu/~ckolar
[PGP Public Key ID: 0xC6492C72]
"CK" == Christopher Kolar <ckolar@admin.aurora.edu> writes:
CK> I am wondering if qrunner got the error message and kept the
CK> item in qfiles, but postfix also deferred delivery of the
CK> message and kept it in the MTA mqueue -- growing by one copy a
CK> minute until the server was able to successfully find the
CK> recipients' hosts.
You're using SMTPDirect.py right? Let's look at how deliver() works:
It tries to create an smtplib.SMTP instance, passing in the hostname and port that you've specified in mm_cfg.py (or inherited from Defaults.py).
This step could raise a socket.error or a general SMTPException. The assumption is that if that happens, the MTA never got the message and essentially delivery failed for all recipients.
Next, the SMTP.sendmail() method is called to sent the message text to the list of recipients. One of two things could happen here:
a. an SMTPRecipientsRefused is raised, meaning that some but not all of the recipients had delivery problems. The exception object has an attribute which contains the failing recipients. The assumption here is that delivery failed to those recipients.
b. the sendmail() method could return a list of failed recipients similar to (a) above.
Each failed recipient has a corresponding error code describing why that recipient failed. Each failed recipient is processed in turn:
a. If the error code is >= 500 but <> 552, then the failure is deemed permanent according to RFC 821 and DRUMS. That address is RegisterBounce()'d and discarded.
b. Otherwise the failure is deemed temporary, so Mailman remembers the address for retry.
If there are any retryable addresses, the message remains in the qfiles queue and retried with the temmporary failure recipients.
So, the only thing I can guess at is that Postfix is returning a temporary failure code for recipients which it still tries to do delivery. Simon Coggins reports similar symptoms with sendmail, but I've never seen them, and I suspect that the situation causing these must be pretty rare.
So that's the idea behind SMTPDirect.py, but I still don't know enough to understand what's causing the dups. Could it be some misunderstanding of the RFC 821 error codes?
-Barry
participants (2)
-
barry@wooz.org
-
Christopher Kolar