It happened again yesterday. Details below.
--On 7. Februar 2018 um 12:43:18 +0900 Yasuhito FUTATSUKI futatuki@poem.co.jp wrote:
In fact,
On 02/02/18 19:26, Sebastian Hagedorn wrote:
root@mailman3/usr/lib/mailman/bin]$ strace -p 1677 Process 1677 attached recvfrom(10, ^CProcess 1677 detached
indicates the OutGoingRunner process 1677 was still in recvfrom(2) system call (perhaps called from recv(2)) for FD 10, and
[root@mailman3/usr/lib/mailman/bin]$ lsof -p 1677 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME python2.7 1677 mailman cwd DIR 253,0 4096 173998 /usr/lib/mailman python2.7 1677 mailman rtd DIR 253,0 4096 2 / ... python2.7 1677 mailman 10u IPv6 46441320 0t0 TCP mailman3.rrz.uni-koeln.de:55764->smtp-out.rrz.uni-koeln.de:smtp (ESTABLISHED)
indicates its FD 10 was ESTABLISHED connection to the MTA.
That situation was exactly the same. This time we confirmed on the MTA that there was no trace of that connection anymore. At the time of the incident, the MTA was once again under high load and delaying commands. That definitely seems to be a contributing factor. We didn't find any evidence of a connection that was dropped by the MTA, but with four OutgoingRunners we didn't find a way to determine which transaction related to which runner.
If the MTA is hanging up (or very slow progress) in application layer and keeping alive TCP connection in lower layer, client using smtplib without specifying timeout, like current SMTPDirect handler in Mailman, must wait for response or the MTA dying.
If I understood Mark correctly, when the MTA dropped the connection that should have raised socket.error regardless of timeouts. The question is why it didn't. I suppose that could be either a bug in our version of the Python libraries or in the OS. Any ideas how we should proceed to determine the root cause?
.:.Sebastian Hagedorn - Weyertal 121 (Gebäude 133), Zimmer 2.02.:.
.:.Regionales Rechenzentrum (RRZK).:.
.:.Universität zu Köln / Cologne University - ✆ +49-221-470-89578.:.