--On 3. Februar 2018 um 19:13:33 -0800 Mark Sapiro mark@msapiro.net wrote:
On 02/03/2018 01:03 AM, Sebastian Hagedorn wrote:
Did you look at the out queue, and if so was there a .bak file there. This would be the entry currently being processed.
I looked at the out queue, and there was no .bak file.
Interesting. That says that OutgoingRunner is not currently delivering a message, but that is inconsistent with this:
Also, the TCP connection to the MTA being ESTABLISHED says the OutgoingRunner has called SMTPDirect.process() and it in turn is somewhere in its delivery loop of sending SMTP transactions.
Is it possible that the OutgoingRunner was done with transmitting the message and had already removed the queue file, but that the connection hadn't yet been closed?
Are there any clues in the MTA logs?
I just found this in Mailman's smtp-failures log:
Feb 01 14:28:49 2018 (1674) Low level smtp error: [Errno 111] Connection refused, msgid: B51BA08829F27146A07699F58B941234A27397BF@EX10DAG2.intern.xxx Feb 01 14:28:49 2018 (1674) delivery to xxx@uni-koeln.de failed with code -1: [Errno 111] Connection refused
Normally, that won't cause a problem like this. This occurs at a fairly low level in SMTPDirect.py when Mailman is initiating a transaction with the MTA to send to one or more recipients. The recipients will be marked as "refused retryably" and OutgoingRunner will queue the message for those recipients. in the retry queue to be retried
You can set SMTPLIB_DEBUG_LEVEL = 1 in mm_cfg.py to log copious smtplib debugging info to Mailman's error log. Then the log will show the last thing that was done before the hang.
The problem with that is that we run 3,200 lists on that server. Not all of them are high-volume, but I'm worried that our log files would explode. Ich just checked and yesterday we sent mails to 50,000 recipients.
If this should happen again, what should we look for? Would a gdb backtrace be helpful?
It might be if you can find just where in the code it's hung. Also, I didn't look carefully before, but in your OP, you show
mailman 1663 0.0 0.0 233860 2204 ? Ss Jan16 0:00 /usr/bin/python2.7 /usr/lib/mailman/bin/mailmanctl -s -q start mailman 1677 0.1 0.9 295064 73284 ? S Jan16 35:35 /usr/bin/python2.7 /usr/lib/mailman/bin/qrunner --runner=OutgoingRunner:3:4 -s
The status of 'S' for OutgoingRunner is "uninterruptable sleep". This means it's either called time.sleep for QRUNNER_SLEEP_TIME (default = 1 second) which is unlikely as it should wake up, or it's waiting for response from something, most likely a response from the MTA.
Now I wonder the MTA had already closed the connection and Mailman for some reason didn't notice. Because the Runner was stuck longer than any timeout on the MTA would permit. But I failed to check that. Should it hapen again I will have a look on the MTA end.
.:.Sebastian Hagedorn - Weyertal 121 (Gebäude 133), Zimmer 2.02.:.
.:.Regionales Rechenzentrum (RRZK).:.
.:.Universität zu Köln / Cologne University - ✆ +49-221-470-89578.:.