On 02/02/2018 02:26 AM, Sebastian Hagedorn wrote:
> Hi,
> we've been running Mailman for many years and have never had stability
> issues, but about a month ago we moved the server from RHEL 5 to RHEL 6
> and to the current version (2.1.25), and since then it has already
> happened twice that one of our four OutgoingRunners got "stuck" and
> stopped handling mail. When that happens a simple restart of the service
> does not work. These processes remained:
> mailman   1663  0.0  0.0 233860  2204 ?        Ss   Jan16   0:00
> /usr/bin/python2.7 /usr/lib/mailman/bin/mailmanctl -s -q start
> mailman   1677  0.1  0.9 295064 73284 ?        S    Jan16  35:35
> /usr/bin/python2.7 /usr/lib/mailman/bin/qrunner
> --runner=OutgoingRunner:3:4 -s

Because pid 1677 didn't respond to the SIGINT from the master and the
master is still waiting for it to exit.

> root at mailman3/usr/lib/mailman/bin]$ strace -p 1677
> Process 1677 attached
> recvfrom(10, ^CProcess 1677 detached
> [root at mailman3/usr/lib/mailman/bin]$ lsof -p 1677
> python2.7 1677 mailman  cwd    DIR    253,0     4096 173998
> /usr/lib/mailman
> python2.7 1677 mailman  rtd    DIR    253,0     4096      2 /
> ...
> python2.7 1677 mailman   10u  IPv6 46441320      0t0    TCP
> mailman3.rrz.uni-koeln.de:55764->smtp-out.rrz.uni-koeln.de:smtp
> In both instances the OutgoingRunner was stuck on an SMTP connection. I
> had to use "kill -9" to get rid of it.
> Any ideas what might be causing that?

I think I've seen this once or maybe twice, I don't recall details. I
wasn't able to determine a cause. I haven't seen it in years.

Did you look at the out queue, and if so was there a .bak file there.
This would be the entry currently being processed.

Also, the TCP connection to the MTA being ESTABLISHED says the
OutgoingRunner has called SMTPDirect.process() and it in turn is
somewhere in its delivery loop of sending SMTP transactions.

Are there any clues in the MTA logs?

