
Hi,
Yesterday we had a major network outage that lasted 2 hours+ This morning I find that the mailman server has "stopped working" I have restarted mailman and now the server but mailman is not processing any emails. The load in top is 0.00 ~0.01 sort of thing, no no disk i/o, no action at all, disks are not full. ssh takes a long time to login but there is no load to cause this. I can see python is running in top but at like 0.3%.
I run mail me@work and postfix sends fast so postfix seems OK, its like Mailman is in a coma, its has chocked to death but its heart is still beating....just seems odd.
What can I do here?
regards
Steven

On 10/17/19 1:41 PM, Steven Jones wrote:
Hi,
Yesterday we had a major network outage that lasted 2 hours+ This morning I find that the mailman server has "stopped working" I have restarted mailman and now the server but mailman is not processing any emails. The load in top is 0.00 ~0.01 sort of thing, no no disk i/o, no action at all, disks are not full. ssh takes a long time to login but there is no load to cause this. I can see python is running in top but at like 0.3%.
You say you restarted Mailman and the server, but is Mailman actually running?
Is Postfix delivering to Mailman (look at Postfix's logs)? Is stuff piling up in Mailman's queues; if so, which one(s).
See the FAQ at https://wiki.list.org/x/4030723 focusing on items 2.2, 6.2, 7, 8 and 9.

thanks
I am sort of suspecting postfix as /var/log/maillog is not being written to.
meanwhile,
/var/log/mailman/smtp has something odd. The time to send email varies between 0.017 and 180 seconds plus, and its getting worse,
========== Oct 18 11:26:13 2019 (9398) mailman.0.1571274477.9729.postgradcoordinators@lists.vuw.ac.nz smtp to postgradcoordinators for 1 recips, completed in 0.017 seconds Oct 18 11:26:26 2019 (9398) mailman.1.1571274477.9729.postgradcoordinators@lists.vuw.ac.nz smtp to postgradcoordinators for 14 recips, completed in 12.419 seconds Oct 18 11:26:27 2019 (9398) mailman.0.1571274594.9753.vuw-schooladmin@lists.vuw.ac.nz smtp to vuw-schooladmin for 14 recips, completed in 0.029 seconds Oct 18 11:26:27 2019 (9398) n/a smtp to its-alerts for 10 recips, completed in 0.073 seconds Oct 18 11:26:27 2019 (9398) mailman.0.1571274783.10156.postgradcoordinators@lists.vuw.ac.nz smtp to postgradcoordinators for 1 recips, completed in 0.019 seconds Oct 18 11:26:27 2019 (9398) mailman.1.1571274783.10156.postgradcoordinators@lists.vuw.ac.nz smtp to postgradcoordinators for 14 recips, completed in 0.019 seconds Oct 18 11:26:39 2019 (9398) n/a smtp to its-alerts for 10 recips, completed in 10.111 seconds Oct 18 11:26:41 2019 (9398) ME2PR01MB4292894A699E1DF46C8DABE8C46D0@ME2PR01MB4292.ausprd01.prod.outlook.com smtp to fad-design-postgrads for 102 recips, completed in 0.248 seconds Oct 18 11:26:41 2019 (9398) n/a smtp to its-alerts for 10 recips, completed in 0.024 seconds Oct 18 11:26:41 2019 (9398) ME2PR01MB4292894A699E1DF46C8DABE8C46D0@ME2PR01MB4292.ausprd01.prod.outlook.com smtp to fad-design for 56 recips, completed in 0.030 seconds Oct 18 11:29:45 2019 (9398) SYAPR01MB26875792826537251B8585DDD46D0@SYAPR01MB2687.ausprd01.prod.outlook.com smtp to nz-libs for 1844 recips, completed in 184.400 seconds Oct 18 11:31:31 2019 (9398) mailman.10788.1571276386.12225.teachingandlearning@lists.vuw.ac.nz smtp to teachingandlearning for 1 recips, completed in 103.629 seconds Oct 18 11:34:22 2019 (9398) n/a smtp to its-alerts for 10 recips, completed in 169.826 seconds Oct 18 11:36:26 2019 (9398) n/a smtp to its-alerts for 10 recips, completed in 123.446 seconds Oct 18 11:39:33 2019 (9398) ME2PR01MB429218FD19F84FE6C16AA291C46D0@ME2PR01MB4292.ausprd01.prod.outlook.com smtp to fad-design for 56 recips, completed in 187.700 seconds 8><----
==========
We are getting mail but very slowly so mailman is working, and item 2.2
[root@vuwunicomailmn1 ~]# ps auxww| grep mailmanctl |grep -v grep mailman 9390 0.0 0.1 214172 9432 ? Ss 09:32 0:00 /usr/bin/python /usr/lib/mailman/bin/mailmanctl -s -q start
and,
[root@vuwunicomailmn1 ~]# service mailman status mailman (pid 9390) is running... [root@vuwunicomailmn1 ~]# service postfix status master (pid 7688) is running... [root@vuwunicomailmn1 ~]#
seems to confirm this.
[root@vuwunicomailmn1 ~]# rpm -q mailman mailman-2.1.12-26.el6_9.3.x86_64
[root@vuwunicomailmn1 ~]# ps auxww | egrep 'p[y]thon' root 8152 0.0 0.5 598144 46040 ? Sl 09:24 0:05 /usr/bin/python2.7 /usr/bin/salt-minion -c /etc/salt -d mailman 9390 0.0 0.1 214172 9432 ? Ss 09:32 0:00 /usr/bin/python /usr/lib/mailman/bin/mailmanctl -s -q start mailman 9393 2.7 2.8 435468 231896 ? S 09:32 3:36 /usr/bin/python /usr/lib/mailman/bin/qrunner --runner=ArchRunner:0:1 -s mailman 9394 0.0 0.2 219840 16204 ? S 09:32 0:01 /usr/bin/python /usr/lib/mailman/bin/qrunner --runner=BounceRunner:0:1 -s mailman 9395 0.0 0.1 216084 12204 ? S 09:32 0:00 /usr/bin/python /usr/lib/mailman/bin/qrunner --runner=CommandRunner:0:1 -s mailman 9396 0.0 0.2 226128 22572 ? S 09:32 0:04 /usr/bin/python /usr/lib/mailman/bin/qrunner --runner=IncomingRunner:0:1 -s mailman 9397 0.0 0.1 216164 12300 ? S 09:32 0:00 /usr/bin/python /usr/lib/mailman/bin/qrunner --runner=NewsRunner:0:1 -s mailman 9398 0.1 0.2 221688 18224 ? S 09:32 0:12 /usr/bin/python /usr/lib/mailman/bin/qrunner --runner=OutgoingRunner:0:1 -s mailman 9399 0.0 0.2 221168 17556 ? S 09:32 0:03 /usr/bin/python /usr/lib/mailman/bin/qrunner --runner=VirginRunner:0:1 -s mailman 9400 0.0 0.1 216068 12200 ? S 09:32 0:00 /usr/bin/python /usr/lib/mailman/bin/qrunner --runner=RetryRunner:0:1 -s root 20071 0.0 0.2 210860 19864 ? S 11:21 0:00 /usr/bin/python /usr/libexec/rhsmd [root@vuwunicomailmn1 ~]#
qrunner,
========= [root@vuwunicomailmn1 mailman]# tail qrunner Oct 18 09:32:13 2019 (9400) RetryRunner qrunner started. Oct 18 09:32:13 2019 (9395) CommandRunner qrunner started. Oct 18 09:32:13 2019 (9396) IncomingRunner qrunner started. Oct 18 09:32:13 2019 (9399) VirginRunner qrunner started. Oct 18 09:32:13 2019 (9397) NewsRunner qrunner started. Oct 18 09:32:13 2019 (9394) BounceRunner qrunner started. Oct 18 09:32:13 2019 (9398) OutgoingRunner qrunner started. Oct 18 09:32:13 2019 (8205) OutgoingRunner qrunner exiting. Oct 18 09:32:13 2019 (8172) Master qrunner detected subprocess exit (pid: 8205, sig: None, sts: 15, class: OutgoingRunner, slice: 1/1) [root@vuwunicomailmn1 mailman]# =========
and its now 11:50am
Ok to assume the above in the qrunner log is normal?
regards
Steven
From: Mailman-Users mailman-users-bounces+steven.jones=vuw.ac.nz@python.org on behalf of Mark Sapiro mark@msapiro.net Sent: Friday, 18 October 2019 10:29 AM To: mailman-users@python.org mailman-users@python.org Subject: Re: [Mailman-Users] mailman not functional
On 10/17/19 1:41 PM, Steven Jones wrote:
Hi,
Yesterday we had a major network outage that lasted 2 hours+ This morning I find that the mailman server has "stopped working" I have restarted mailman and now the server but mailman is not processing any emails. The load in top is 0.00 ~0.01 sort of thing, no no disk i/o, no action at all, disks are not full. ssh takes a long time to login but there is no load to cause this. I can see python is running in top but at like 0.3%.
You say you restarted Mailman and the server, but is Mailman actually running?
Is Postfix delivering to Mailman (look at Postfix's logs)? Is stuff piling up in Mailman's queues; if so, which one(s).
See the FAQ at https://apc01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwiki.list.org%2Fx%2F4030723&data=02%7C01%7Csteven.jones%40vuw.ac.nz%7C1427d2215a2f408bdec408d753494165%7Ccfe63e236951427e8683bb84dcf1d20c%7C0%7C0%7C637069446456693367&sdata=xGByjlzOz0d6N7QNaT17ARO5kH3yXSli%2FHDvdmwlErQ%3D&reserved=0 focusing on items 2.2, 6.2, 7, 8 and 9.
-- Mark Sapiro mark@msapiro.net The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
Mailman-Users mailing list Mailman-Users@python.org https://apc01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmail.pytho... Mailman FAQ: https://apc01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwiki.list.o... Security Policy: https://apc01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwiki.list.o... Searchable Archives: https://apc01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.mail-ar... Unsubscribe: https://apc01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmail.pytho...

On 10/17/19 3:59 PM, Steven Jones wrote:
thanks
I am sort of suspecting postfix as /var/log/maillog is not being written to.
meanwhile,
/var/log/mailman/smtp has something odd. The time to send email varies between 0.017 and 180 seconds plus, and its getting worse,
========== Oct 18 11:26:13 2019 (9398) mailman.0.1571274477.9729.postgradcoordinators@lists.vuw.ac.nz smtp to postgradcoordinators for 1 recips, completed in 0.017 seconds Oct 18 11:26:26 2019 (9398) mailman.1.1571274477.9729.postgradcoordinators@lists.vuw.ac.nz smtp to postgradcoordinators for 14 recips, completed in 12.419 seconds Oct 18 11:26:27 2019 (9398) mailman.0.1571274594.9753.vuw-schooladmin@lists.vuw.ac.nz smtp to vuw-schooladmin for 14 recips, completed in 0.029 seconds Oct 18 11:26:27 2019 (9398) n/a smtp to its-alerts for 10 recips, completed in 0.073 seconds Oct 18 11:26:27 2019 (9398) mailman.0.1571274783.10156.postgradcoordinators@lists.vuw.ac.nz smtp to postgradcoordinators for 1 recips, completed in 0.019 seconds Oct 18 11:26:27 2019 (9398) mailman.1.1571274783.10156.postgradcoordinators@lists.vuw.ac.nz smtp to postgradcoordinators for 14 recips, completed in 0.019 seconds Oct 18 11:26:39 2019 (9398) n/a smtp to its-alerts for 10 recips, completed in 10.111 seconds Oct 18 11:26:41 2019 (9398) ME2PR01MB4292894A699E1DF46C8DABE8C46D0@ME2PR01MB4292.ausprd01.prod.outlook.com smtp to fad-design-postgrads for 102 recips, completed in 0.248 seconds Oct 18 11:26:41 2019 (9398) n/a smtp to its-alerts for 10 recips, completed in 0.024 seconds Oct 18 11:26:41 2019 (9398) ME2PR01MB4292894A699E1DF46C8DABE8C46D0@ME2PR01MB4292.ausprd01.prod.outlook.com smtp to fad-design for 56 recips, completed in 0.030 seconds Oct 18 11:29:45 2019 (9398) SYAPR01MB26875792826537251B8585DDD46D0@SYAPR01MB2687.ausprd01.prod.outlook.com smtp to nz-libs for 1844 recips, completed in 184.400 seconds Oct 18 11:31:31 2019 (9398) mailman.10788.1571276386.12225.teachingandlearning@lists.vuw.ac.nz smtp to teachingandlearning for 1 recips, completed in 103.629 seconds Oct 18 11:34:22 2019 (9398) n/a smtp to its-alerts for 10 recips, completed in 169.826 seconds Oct 18 11:36:26 2019 (9398) n/a smtp to its-alerts for 10 recips, completed in 123.446 seconds Oct 18 11:39:33 2019 (9398) ME2PR01MB429218FD19F84FE6C16AA291C46D0@ME2PR01MB4292.ausprd01.prod.outlook.com smtp to fad-design for 56 recips, completed in 187.700 seconds 8><----
Mailman's out queue is backlogged. See https://wiki.list.org/x/17892002. Something is changed affecting Postfix. Possibly you had a local DNS cache that is now not working and Postfix DNS lookups are taking a long time. Also consider a separate Postfix port for Mailman delivery with minimal checking. On mail.python.org we use something like this in master.cf
127.0.0.1:8027 inet n - - - - smtpd -o smtpd_recipient_restrictions=permit_mynetworks,reject -o smtpd_client_restrictions= -o smtpd_helo_restrictions= -o smtpd_sender_restrictions= -o smtpd_data_restrictions=
...
========= [root@vuwunicomailmn1 mailman]# tail qrunner Oct 18 09:32:13 2019 (9400) RetryRunner qrunner started. Oct 18 09:32:13 2019 (9395) CommandRunner qrunner started. Oct 18 09:32:13 2019 (9396) IncomingRunner qrunner started. Oct 18 09:32:13 2019 (9399) VirginRunner qrunner started. Oct 18 09:32:13 2019 (9397) NewsRunner qrunner started. Oct 18 09:32:13 2019 (9394) BounceRunner qrunner started. Oct 18 09:32:13 2019 (9398) OutgoingRunner qrunner started. Oct 18 09:32:13 2019 (8205) OutgoingRunner qrunner exiting. Oct 18 09:32:13 2019 (8172) Master qrunner detected subprocess exit (pid: 8205, sig: None, sts: 15, class: OutgoingRunner, slice: 1/1) [root@vuwunicomailmn1 mailman]# =========
and its now 11:50am
Ok to assume the above in the qrunner log is normal?
The last three lines are not normal OutgoingRunner has died, but it seems these are the result of a Mailman restart and the old OutgoingRunner persisted for a while because it had to finish delivery of the current message. Note the out of sequence PID and the fact that ps shows 9398 is running.

- Steven Jones steven.jones@vuw.ac.nz:
thanks
I am sort of suspecting postfix as /var/log/maillog is not being written to.
That's bad. So try stopping and starting postfix (after stopping check with "ps auxwww|fgrep post" to see if there's anything left!!!)
and have a look at /var/log/maillog while doing this.
Ralf Hildebrandt Charité - Universitätsmedizin Berlin Geschäftsbereich IT | Abteilung Netzwerk
Campus Benjamin Franklin (CBF) Haus I | 1. OG | Raum 105 Hindenburgdamm 30 | D-12203 Berlin
Tel. +49 30 450 570 155 ralf.hildebrandt@charite.de https://www.charite.de

On 10/18/2019 12:42 AM, Ralf Hildebrandt wrote:
I am sort of suspecting postfix as /var/log/maillog is not being written to.
That's bad. So try stopping and starting postfix (after stopping check with "ps auxwww|fgrep post" to see if there's anything left!!!)
and have a look at /var/log/maillog while doing this.
While I don't object at all to the conversation, this is venturing outside of mailman and into system management (DNS function, are the file systems full, which processes are running, etc).
Later,
z!

Hi,
Thanks, traced this to using tcp connections to the remote logging server (at the insistence of the security manager) rather than udp. So rsyslog on the mailman server locked up as the remote server's rsyslog had locked up due to overload. Once I restarted the remote rsyslog daemon the 1800 mail queue disappeared so fast I didnt have time to see it go.
Thanks all.
regards
Steven
From: Mailman-Users mailman-users-bounces+steven.jones=vuw.ac.nz@python.org on behalf of Ralf Hildebrandt Ralf.Hildebrandt@charite.de Sent: Friday, 18 October 2019 8:42 PM To: mailman-users@python.org mailman-users@python.org Subject: Re: [Mailman-Users] [ext] Re: mailman not functional
- Steven Jones steven.jones@vuw.ac.nz:
thanks
I am sort of suspecting postfix as /var/log/maillog is not being written to.
That's bad. So try stopping and starting postfix (after stopping check with "ps auxwww|fgrep post" to see if there's anything left!!!)
and have a look at /var/log/maillog while doing this.
Ralf Hildebrandt Charité - Universitätsmedizin Berlin Geschäftsbereich IT | Abteilung Netzwerk
Campus Benjamin Franklin (CBF) Haus I | 1. OG | Raum 105 Hindenburgdamm 30 | D-12203 Berlin
Tel. +49 30 450 570 155 ralf.hildebrandt@charite.de https://apc01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.charit...
Mailman-Users mailing list Mailman-Users@python.org https://apc01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmail.pytho... Mailman FAQ: https://apc01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwiki.list.o... Security Policy: https://apc01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwiki.list.o... Searchable Archives: https://apc01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.mail-ar... Unsubscribe: https://apc01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmail.pytho...

On 17 Oct 2019, at 16:41, Steven Jones wrote:
ssh takes a long time to login but there is no load to cause this.
The first thing to suspect when logins take a long time with no tangible loading issues (i/o, memory, and CPU all good) is DNS. If your resolver is trying to query a non-responsive DNS server or servers but eventually hits one that works, login and anything involving email transport will be plagued by delays.
So: check /etc/resolv.conf and make sure all listed servers are responding to queries.
participants (5)
-
Bill Cole
-
Carl Zwanzig
-
Mark Sapiro
-
Ralf Hildebrandt
-
Steven Jones