
We are having problems with mailman needing to be restarted to complete sending email to our lists. The problem has been getting worse over time and is now almost a daily requirement. I spent some time debugging this morning and found some issues.
I used top to look at the number of active processes that are owned by user mailman. I found 35 processes including 14 command "mailmanctl". This looks like it might be the source (at least one of them) to our problems. Mailman is currently delivering email to our users after a restart.
I would like to clear this up without rebooting our server. Please give me some input on a plan to clean this up. Here is a start:
- wait for the need for mailman to be low -- we send email three times per day to newsletters & and have no forums/blogs
- service mailman stop
- delete all processes running under mailman
- ?? should I clear lock files or other actions ??
- service mailman start
- verify the proper number of process are running -- ?? there should be eight processes ??
I would also like input on additional debug techniques to find the root cause of this issue.
Thanks, Greg

On 01/11/2016 10:47 AM, Greg Sims wrote:
This is a big problem. See the FAQ at <http://wiki.list.org/x/4030715> for instructions on stopping Mailman completely and ythen start it only once.
Note that several packages and even our source provide an init.d script that runs mailmanctl with the -s option, however versions prior to 2.1.16 could start additional instances when this option was used. See <https://bugs.launchpad.net/mailman/+bug/1189558>.
Your plan looks good. Once Mailman and all its qrunners are stopped, there should be lothing in Mailman's locks/ directory. If there are any files there at that point, remove them.
There should be nine processes, eight qrunners and mailmanctl as in
I would also like input on additional debug techniques to find the root cause of this issue.
Start by looking at Mailman's error log. Also look at Mailman's qrunner smtp and smtp-failure logs.
If the problem recurs it's almost certainly OutgoingRunner. At that point, look at Mailman's qfiles/out/ directory.
If OutgoingRunner is 'stuck' there will be one file in that queue with a .bak extension (at least in MM versions since 2.1.9). Other files if any will have .pck extension. The file with the .bak extension is the one being processed. You can look at it with 'bin/dumpdb -p', but it would probably look OK.
If OutgoingRunner seems stuck, sending it a SIGHUP may unstick it (SIGHUP normally says reopen your log files). Also look at locks (see <http://wiki.list.org/x/17891756> for info on lock files).
If it is stuck, there are ways to get a stack trace from it, but I think they normally require some modification to the code before it starts.
There is another possibility that the hung process is IncomingRunner. In this case, the out/ queue would be empty and the in/ queue would have the one .bak file. The rest of the above applies.
My guess though is OutgoingRunner is stuck. You stop Mailman and it mostly stops, but not OutgoingRunner. You then start it again and this starts a new OutgoingRunner which sees the old one 'crashed' and left the .bak behind and starts by processing that. Do this a few times and you have a number of stuck OutgoingRunner processes hanging around.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan

On 01/11/2016 10:47 AM, Greg Sims wrote:
This is a big problem. See the FAQ at <http://wiki.list.org/x/4030715> for instructions on stopping Mailman completely and ythen start it only once.
Note that several packages and even our source provide an init.d script that runs mailmanctl with the -s option, however versions prior to 2.1.16 could start additional instances when this option was used. See <https://bugs.launchpad.net/mailman/+bug/1189558>.
Your plan looks good. Once Mailman and all its qrunners are stopped, there should be lothing in Mailman's locks/ directory. If there are any files there at that point, remove them.
There should be nine processes, eight qrunners and mailmanctl as in
I would also like input on additional debug techniques to find the root cause of this issue.
Start by looking at Mailman's error log. Also look at Mailman's qrunner smtp and smtp-failure logs.
If the problem recurs it's almost certainly OutgoingRunner. At that point, look at Mailman's qfiles/out/ directory.
If OutgoingRunner is 'stuck' there will be one file in that queue with a .bak extension (at least in MM versions since 2.1.9). Other files if any will have .pck extension. The file with the .bak extension is the one being processed. You can look at it with 'bin/dumpdb -p', but it would probably look OK.
If OutgoingRunner seems stuck, sending it a SIGHUP may unstick it (SIGHUP normally says reopen your log files). Also look at locks (see <http://wiki.list.org/x/17891756> for info on lock files).
If it is stuck, there are ways to get a stack trace from it, but I think they normally require some modification to the code before it starts.
There is another possibility that the hung process is IncomingRunner. In this case, the out/ queue would be empty and the in/ queue would have the one .bak file. The rest of the above applies.
My guess though is OutgoingRunner is stuck. You stop Mailman and it mostly stops, but not OutgoingRunner. You then start it again and this starts a new OutgoingRunner which sees the old one 'crashed' and left the .bak behind and starts by processing that. Do this a few times and you have a number of stuck OutgoingRunner processes hanging around.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
participants (2)
-
Greg Sims
-
Mark Sapiro