Hello,
I have mailman 2.1.14 running on RHEL 5.9 with sendmail 8.13.8-8.1.el5_7, and Python 2.4.3.
The problem presents itself by mailman no longer sending out mail sent to the lists. The mail is queuing up, and when mailman is stopped and started it all delivers. That leads to the other strange part. All of the mailman daemons stop when I run the stop script, except
mailman 15854 1 0 10:01 ? 00:00:00 /usr/bin/python /usr/local/mailman/bin/mailmanctl -s start mailman 15861 15854 0 10:01 ? 00:00:06 /usr/bin/python /usr/local/mailman/bin/qrunner --runner=OutgoingRunner:0:1 -s
I have to kill the outgoingrunner specifically. The only thing I see in the logs is a lack of logging. It has been running with stunning reliability on this machine for the last few years, so I am not sure what is going on. Perhaps one of redhat's patches killed it.
James Millsap The University of Chicago Booth School of Business 5807 South Woodlawn Avenue Chicago, IL 60637 (773) 702-7955
On 4/10/2013 8:43 AM, Millsap, James wrote:
mailman 15854 1 0 10:01 ? 00:00:00 /usr/bin/python /usr/local/mailman/bin/mailmanctl -s start mailman 15861 15854 0 10:01 ? 00:00:06 /usr/bin/python /usr/local/mailman/bin/qrunner --runner=OutgoingRunner:0:1 -s
I have to kill the outgoingrunner specifically. The only thing I see in the logs is a lack of logging. It has been running with stunning reliability on this machine for the last few years, so I am not sure what is going on. Perhaps one of redhat's patches killed it.
Can you kill -TERM it or do you need to kill -KILL it?
Are you sure there's nothing relevant in Mailman's qrunner log (/var/log/mailman/qrunner if a rhel packaged Mailman)? Is there a current .bak file in the out queue (/var/spool/mailman/out/)
What does 'lsof' show for the process? You might be able to get something useful from 'gdb' or maybe see something like http://stackoverflow.com/questions/132058/showing-the-stack-trace-from-a-run....
If I had to guess, I'd guess it gets hung waiting for an SMTP response from the MTA.
-- Mark Sapiro mark@msapiro.net The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
Unfortunately It is difficult as this machine is critical to our operations, I don't have a whole lot of time to troubleshoot, before I must have it up and running. It usually takes around two days for this issue to come up. -TERM will kill it, no need to use --KILL. This is built from source so no redhat packages. This is what I have in the qrunner log.
Apr 10 10:01:08 2013 (17606) ArchRunner qrunner caught SIGTERM. Stopping. Apr 10 10:01:08 2013 (17606) ArchRunner qrunner exiting. Apr 10 10:01:08 2013 (17611) OutgoingRunner qrunner caught SIGTERM. Stopping. Apr 10 10:01:08 2013 (17612) VirginRunner qrunner caught SIGTERM. Stopping. Apr 10 10:01:08 2013 (17612) VirginRunner qrunner exiting. Apr 10 10:01:08 2013 (17607) BounceRunner qrunner caught SIGTERM. Stopping. Apr 10 10:01:08 2013 (17608) CommandRunner qrunner caught SIGTERM. Stopping. Apr 10 10:01:08 2013 (17608) CommandRunner qrunner exiting. Apr 10 10:01:08 2013 (17609) IncomingRunner qrunner caught SIGTERM. Stopping. Apr 10 10:01:08 2013 (17609) IncomingRunner qrunner exiting. Apr 10 10:01:08 2013 (17610) NewsRunner qrunner caught SIGTERM. Stopping. Apr 10 10:01:08 2013 (17610) NewsRunner qrunner exiting. Apr 10 10:01:08 2013 (17613) RetryRunner qrunner caught SIGTERM. Stopping. Apr 10 10:01:08 2013 (17613) RetryRunner qrunner exiting. Apr 10 10:01:08 2013 (17604) Master watcher caught SIGTERM. Exiting. Apr 10 10:01:08 2013 (17604) Master qrunner detected subprocess exit (pid: 17606, sig: None, sts: 15, class: ArchRunner, slice: 1/1) Apr 10 10:01:08 2013 (17604) Master qrunner detected subprocess exit (pid: 17608, sig: None, sts: 15, class: CommandRunner, slice: 1/1) Apr 10 10:01:08 2013 (17604) Master qrunner detected subprocess exit (pid: 17609, sig: None, sts: 15, class: IncomingRunner, slice: 1/1) Apr 10 10:01:08 2013 (17604) Master qrunner detected subprocess exit (pid: 17610, sig: None, sts: 15, class: NewsRunner, slice: 1/1) Apr 10 10:01:08 2013 (17604) Master qrunner detected subprocess exit (pid: 17612, sig: None, sts: 15, class: VirginRunner, slice: 1/1) Apr 10 10:01:08 2013 (17604) Master qrunner detected subprocess exit (pid: 17613, sig: None, sts: 15, class: RetryRunner, slice: 1/1) Apr 10 10:01:08 2013 (17607) BounceRunner qrunner exiting. Apr 10 10:01:08 2013 (17604) Master qrunner detected subprocess exit (pid: 17607, sig: None, sts: 15, class: BounceRunner, slice: 1/1) Apr 10 10:01:37 2013 (17611) OutgoingRunner qrunner caught SIGTERM. Stopping. Apr 10 10:01:37 2013 (17604) Master watcher caught SIGTERM. Exiting. Apr 10 10:01:37 2013 (17611) OutgoingRunner qrunner caught SIGTERM. Stopping. Apr 10 10:01:37 2013 (17611) OutgoingRunner qrunner exiting. Apr 10 10:01:38 2013 (17604) Master qrunner detected subprocess exit (pid: 17611, sig: None, sts: 15, class: OutgoingRunner, slice: 1/1) Apr 10 10:01:58 2013 (15858) CommandRunner qrunner started. Apr 10 10:01:59 2013 (15859) IncomingRunner qrunner started. Apr 10 10:01:59 2013 (15856) ArchRunner qrunner started. Apr 10 10:01:59 2013 (15857) BounceRunner qrunner started. Apr 10 10:01:59 2013 (15862) VirginRunner qrunner started. Apr 10 10:01:59 2013 (15860) NewsRunner qrunner started. Apr 10 10:01:59 2013 (15863) RetryRunner qrunner started. Apr 10 10:01:59 2013 (15861) OutgoingRunner qrunner started.
-----Original Message----- From: Mark Sapiro [mailto:mark@msapiro.net] Sent: Wednesday, April 10, 2013 3:59 PM To: Millsap, James Cc: mailman-users@python.org Subject: Re: [Mailman-Users] mailman 2.1.14 stops sending mail
On 4/10/2013 8:43 AM, Millsap, James wrote:
mailman 15854 1 0 10:01 ? 00:00:00 /usr/bin/python /usr/local/mailman/bin/mailmanctl -s start mailman 15861 15854 0 10:01 ? 00:00:06 /usr/bin/python /usr/local/mailman/bin/qrunner --runner=OutgoingRunner:0:1 -s
I have to kill the outgoingrunner specifically. The only thing I see in the logs is a lack of logging. It has been running with stunning reliability on this machine for the last few years, so I am not sure what is going on. Perhaps one of redhat's patches killed it.
Can you kill -TERM it or do you need to kill -KILL it?
Are you sure there's nothing relevant in Mailman's qrunner log (/var/log/mailman/qrunner if a rhel packaged Mailman)? Is there a current .bak file in the out queue (/var/spool/mailman/out/)
What does 'lsof' show for the process? You might be able to get something useful from 'gdb' or maybe see something like http://stackoverflow.com/questions/132058/showing-the-stack-trace-from-a-run....
If I had to guess, I'd guess it gets hung waiting for an SMTP response from the MTA.
-- Mark Sapiro mark@msapiro.net The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
On Thu, Apr 11, 2013 at 12:07 PM, Millsap, James James.Millsap@chicagobooth.edu wrote:
Unfortunately It is difficult as this machine is critical to our operations, I don't have a whole lot of time to troubleshoot, before I must have it up and running. It usually takes around two days for this issue to come up. -TERM will kill it, no need to use --KILL. This is built from source so no redhat packages. This is what I have in the qrunner log.
Apr 10 10:01:08 2013 (17606) ArchRunner qrunner caught SIGTERM. Stopping. Apr 10 10:01:08 2013 (17606) ArchRunner qrunner exiting. Apr 10 10:01:08 2013 (17611) OutgoingRunner qrunner caught SIGTERM. Stopping. Apr 10 10:01:08 2013 (17612) VirginRunner qrunner caught SIGTERM. Stopping. Apr 10 10:01:08 2013 (17612) VirginRunner qrunner exiting. Apr 10 10:01:08 2013 (17607) BounceRunner qrunner caught SIGTERM. Stopping. Apr 10 10:01:08 2013 (17608) CommandRunner qrunner caught SIGTERM. Stopping. Apr 10 10:01:08 2013 (17608) CommandRunner qrunner exiting. Apr 10 10:01:08 2013 (17609) IncomingRunner qrunner caught SIGTERM. Stopping. Apr 10 10:01:08 2013 (17609) IncomingRunner qrunner exiting. Apr 10 10:01:08 2013 (17610) NewsRunner qrunner caught SIGTERM. Stopping. Apr 10 10:01:08 2013 (17610) NewsRunner qrunner exiting. Apr 10 10:01:08 2013 (17613) RetryRunner qrunner caught SIGTERM. Stopping. Apr 10 10:01:08 2013 (17613) RetryRunner qrunner exiting. Apr 10 10:01:08 2013 (17604) Master watcher caught SIGTERM. Exiting. Apr 10 10:01:08 2013 (17604) Master qrunner detected subprocess exit (pid: 17606, sig: None, sts: 15, class: ArchRunner, slice: 1/1) Apr 10 10:01:08 2013 (17604) Master qrunner detected subprocess exit (pid: 17608, sig: None, sts: 15, class: CommandRunner, slice: 1/1) Apr 10 10:01:08 2013 (17604) Master qrunner detected subprocess exit (pid: 17609, sig: None, sts: 15, class: IncomingRunner, slice: 1/1) Apr 10 10:01:08 2013 (17604) Master qrunner detected subprocess exit (pid: 17610, sig: None, sts: 15, class: NewsRunner, slice: 1/1) Apr 10 10:01:08 2013 (17604) Master qrunner detected subprocess exit (pid: 17612, sig: None, sts: 15, class: VirginRunner, slice: 1/1) Apr 10 10:01:08 2013 (17604) Master qrunner detected subprocess exit (pid: 17613, sig: None, sts: 15, class: RetryRunner, slice: 1/1) Apr 10 10:01:08 2013 (17607) BounceRunner qrunner exiting. Apr 10 10:01:08 2013 (17604) Master qrunner detected subprocess exit (pid: 17607, sig: None, sts: 15, class: BounceRunner, slice: 1/1) Apr 10 10:01:37 2013 (17611) OutgoingRunner qrunner caught SIGTERM. Stopping. Apr 10 10:01:37 2013 (17604) Master watcher caught SIGTERM. Exiting. Apr 10 10:01:37 2013 (17611) OutgoingRunner qrunner caught SIGTERM. Stopping. Apr 10 10:01:37 2013 (17611) OutgoingRunner qrunner exiting. Apr 10 10:01:38 2013 (17604) Master qrunner detected subprocess exit (pid: 17611, sig: None, sts: 15, class: OutgoingRunner, slice: 1/1)
.... approx 20 seconds time ....
Apr 10 10:01:58 2013 (15858) CommandRunner qrunner started. Apr 10 10:01:59 2013 (15859) IncomingRunner qrunner started. Apr 10 10:01:59 2013 (15856) ArchRunner qrunner started. Apr 10 10:01:59 2013 (15857) BounceRunner qrunner started. Apr 10 10:01:59 2013 (15862) VirginRunner qrunner started. Apr 10 10:01:59 2013 (15860) NewsRunner qrunner started. Apr 10 10:01:59 2013 (15863) RetryRunner qrunner started. Apr 10 10:01:59 2013 (15861) OutgoingRunner qrunner started.
To me, the above looks like a system reboot. Is something rebooting the box at 10am?
-Jim P.
Nope, it just took me that long to make sure all of the processes were down, and restart them.
-----Original Message----- From: Mailman-Users [mailto:mailman-users-bounces+james.millsap=chicagobooth.edu@python.org] On Behalf Of Jim Popovitch Sent: Thursday, April 11, 2013 1:23 PM To: mailman-users@python.org Subject: Re: [Mailman-Users] mailman 2.1.14 stops sending mail
On Thu, Apr 11, 2013 at 12:07 PM, Millsap, James James.Millsap@chicagobooth.edu wrote:
Unfortunately It is difficult as this machine is critical to our operations, I don't have a whole lot of time to troubleshoot, before I must have it up and running. It usually takes around two days for this issue to come up. -TERM will kill it, no need to use --KILL. This is built from source so no redhat packages. This is what I have in the qrunner log.
Apr 10 10:01:08 2013 (17606) ArchRunner qrunner caught SIGTERM. Stopping. Apr 10 10:01:08 2013 (17606) ArchRunner qrunner exiting. Apr 10 10:01:08 2013 (17611) OutgoingRunner qrunner caught SIGTERM. Stopping. Apr 10 10:01:08 2013 (17612) VirginRunner qrunner caught SIGTERM. Stopping. Apr 10 10:01:08 2013 (17612) VirginRunner qrunner exiting. Apr 10 10:01:08 2013 (17607) BounceRunner qrunner caught SIGTERM. Stopping. Apr 10 10:01:08 2013 (17608) CommandRunner qrunner caught SIGTERM. Stopping. Apr 10 10:01:08 2013 (17608) CommandRunner qrunner exiting. Apr 10 10:01:08 2013 (17609) IncomingRunner qrunner caught SIGTERM. Stopping. Apr 10 10:01:08 2013 (17609) IncomingRunner qrunner exiting. Apr 10 10:01:08 2013 (17610) NewsRunner qrunner caught SIGTERM. Stopping. Apr 10 10:01:08 2013 (17610) NewsRunner qrunner exiting. Apr 10 10:01:08 2013 (17613) RetryRunner qrunner caught SIGTERM. Stopping. Apr 10 10:01:08 2013 (17613) RetryRunner qrunner exiting. Apr 10 10:01:08 2013 (17604) Master watcher caught SIGTERM. Exiting. Apr 10 10:01:08 2013 (17604) Master qrunner detected subprocess exit (pid: 17606, sig: None, sts: 15, class: ArchRunner, slice: 1/1) Apr 10 10:01:08 2013 (17604) Master qrunner detected subprocess exit (pid: 17608, sig: None, sts: 15, class: CommandRunner, slice: 1/1) Apr 10 10:01:08 2013 (17604) Master qrunner detected subprocess exit (pid: 17609, sig: None, sts: 15, class: IncomingRunner, slice: 1/1) Apr 10 10:01:08 2013 (17604) Master qrunner detected subprocess exit (pid: 17610, sig: None, sts: 15, class: NewsRunner, slice: 1/1) Apr 10 10:01:08 2013 (17604) Master qrunner detected subprocess exit (pid: 17612, sig: None, sts: 15, class: VirginRunner, slice: 1/1) Apr 10 10:01:08 2013 (17604) Master qrunner detected subprocess exit (pid: 17613, sig: None, sts: 15, class: RetryRunner, slice: 1/1) Apr 10 10:01:08 2013 (17607) BounceRunner qrunner exiting. Apr 10 10:01:08 2013 (17604) Master qrunner detected subprocess exit (pid: 17607, sig: None, sts: 15, class: BounceRunner, slice: 1/1) Apr 10 10:01:37 2013 (17611) OutgoingRunner qrunner caught SIGTERM. Stopping. Apr 10 10:01:37 2013 (17604) Master watcher caught SIGTERM. Exiting. Apr 10 10:01:37 2013 (17611) OutgoingRunner qrunner caught SIGTERM. Stopping. Apr 10 10:01:37 2013 (17611) OutgoingRunner qrunner exiting. Apr 10 10:01:38 2013 (17604) Master qrunner detected subprocess exit (pid: 17611, sig: None, sts: 15, class: OutgoingRunner, slice: 1/1)
.... approx 20 seconds time ....
Apr 10 10:01:58 2013 (15858) CommandRunner qrunner started. Apr 10 10:01:59 2013 (15859) IncomingRunner qrunner started. Apr 10 10:01:59 2013 (15856) ArchRunner qrunner started. Apr 10 10:01:59 2013 (15857) BounceRunner qrunner started. Apr 10 10:01:59 2013 (15862) VirginRunner qrunner started. Apr 10 10:01:59 2013 (15860) NewsRunner qrunner started. Apr 10 10:01:59 2013 (15863) RetryRunner qrunner started. Apr 10 10:01:59 2013 (15861) OutgoingRunner qrunner started.
To me, the above looks like a system reboot. Is something rebooting the box at 10am?
-Jim P.
Mailman-Users mailing list Mailman-Users@python.org http://mail.python.org/mailman/listinfo/mailman-users Mailman FAQ: http://wiki.list.org/x/AgA3 Security Policy: http://wiki.list.org/x/QIA9 Searchable Archives: http://www.mail-archive.com/mailman-users%40python.org/ Unsubscribe: http://mail.python.org/mailman/options/mailman-users/james.millsap%40chicago...
On 4/11/2013 9:07 AM, Millsap, James wrote:
Unfortunately It is difficult as this machine is critical to our operations, I don't have a whole lot of time to troubleshoot, before I must have it up and running. It usually takes around two days for this issue to come up. -TERM will kill it, no need to use --KILL. This is built from source so no redhat packages. This is what I have in the qrunner log.
[...]
Apr 10 10:01:08 2013 (17611) OutgoingRunner qrunner caught SIGTERM. Stopping. [...] Apr 10 10:01:08 2013 (17604) Master watcher caught SIGTERM. Exiting. [...] Apr 10 10:01:37 2013 (17604) Master watcher caught SIGTERM. Exiting. Apr 10 10:01:37 2013 (17611) OutgoingRunner qrunner caught SIGTERM. Stopping. Apr 10 10:01:37 2013 (17611) OutgoingRunner qrunner exiting. Apr 10 10:01:38 2013 (17604) Master qrunner detected subprocess exit (pid: 17611, sig: None, sts: 15, class: OutgoingRunner, slice: 1/1) [...]
Interesting that OutgoingRunner wouldn't exit until SIGTERMed a second time. It seems highly likely that it is waiting on something 'not interruptable' and this is why it stops processing in the first place and is reluctant to die.
The real question is what's it waiting on and why? Without the answer or some more clue to this, I don't know what.
Check the MTA logs from the time OutgoingRunner 'hung' and the time it was SIGTERMed. Also consider enabling smtplib debug logging (see http://wiki.list.org/x/-IA9).
-- Mark Sapiro mark@msapiro.net The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
participants (3)
-
Jim Popovitch
-
Mark Sapiro
-
Millsap, James