[Mailman-Users] Daily Restarts Needed
Mark Sapiro
mark at msapiro.net
Mon Jan 11 20:29:26 EST 2016
On 01/11/2016 10:47 AM, Greg Sims wrote:
> We are having problems with mailman needing to be restarted to complete
> sending email to our lists. The problem has been getting worse over time
> and is now almost a daily requirement. I spent some time debugging this
> morning and found some issues.
>
> I used top to look at the number of active processes that are owned by user
> mailman. I found 35 processes including 14 command "mailmanctl". This
> looks like it might be the source (at least one of them) to our problems.
> Mailman is currently delivering email to our users after a restart.
This is a big problem. See the FAQ at <http://wiki.list.org/x/4030715>
for instructions on stopping Mailman completely and ythen start it only
once.
Note that several packages and even our source provide an init.d script
that runs mailmanctl with the -s option, however versions prior to
2.1.16 could start additional instances when this option was used. See
<https://bugs.launchpad.net/mailman/+bug/1189558>.
> I would like to clear this up without rebooting our server. Please give me
> some input on a plan to clean this up. Here is a start:
>
> - wait for the need for mailman to be low -- we send email three times
> per day to newsletters & and have no forums/blogs
> - service mailman stop
> - delete all processes running under mailman
> - ?? should I clear lock files or other actions ??
> - service mailman start
> - verify the proper number of process are running -- ?? there should be
> eight processes ??
Your plan looks good. Once Mailman and all its qrunners are stopped,
there should be lothing in Mailman's locks/ directory. If there are any
files there at that point, remove them.
There should be nine processes, eight qrunners and mailmanctl as in
> mark at sbh16:~$ ps -fwwU mailman
> UID PID PPID C STIME TTY TIME CMD
> mailman 20219 1 0 Jan09 ? 00:00:00 /usr/bin/python /usr/local/mailman/bin/mailmanctl -s -q start
> mailman 20221 20219 0 Jan09 ? 00:00:14 /usr/bin/python /usr/local/mailman/bin/qrunner --runner=ArchRunner:0:1 -s
> mailman 20222 20219 0 Jan09 ? 00:00:14 /usr/bin/python /usr/local/mailman/bin/qrunner --runner=BounceRunner:0:1 -s
> mailman 20223 20219 0 Jan09 ? 00:00:13 /usr/bin/python /usr/local/mailman/bin/qrunner --runner=CommandRunner:0:1 -s
> mailman 20224 20219 0 Jan09 ? 00:00:13 /usr/bin/python /usr/local/mailman/bin/qrunner --runner=IncomingRunner:0:1 -s
> mailman 20225 20219 0 Jan09 ? 00:00:13 /usr/bin/python /usr/local/mailman/bin/qrunner --runner=NewsRunner:0:1 -s
> mailman 20226 20219 0 Jan09 ? 00:00:22 /usr/bin/python /usr/local/mailman/bin/qrunner --runner=OutgoingRunner:0:1 -s
> mailman 20227 20219 0 Jan09 ? 00:00:18 /usr/bin/python /usr/local/mailman/bin/qrunner --runner=VirginRunner:0:1 -s
> mailman 20228 20219 0 Jan09 ? 00:00:00 /usr/bin/python /usr/local/mailman/bin/qrunner --runner=RetryRunner:0:1 -s
> mark at sbh16:~$
> I would also like input on additional debug techniques to find the root
> cause of this issue.
Start by looking at Mailman's error log. Also look at Mailman's qrunner
smtp and smtp-failure logs.
If the problem recurs it's almost certainly OutgoingRunner. At that
point, look at Mailman's qfiles/out/ directory.
If OutgoingRunner is 'stuck' there will be one file in that queue with a
.bak extension (at least in MM versions since 2.1.9). Other files if any
will have .pck extension. The file with the .bak extension is the one
being processed. You can look at it with 'bin/dumpdb -p', but it would
probably look OK.
If OutgoingRunner seems stuck, sending it a SIGHUP may unstick it
(SIGHUP normally says reopen your log files). Also look at locks (see
<http://wiki.list.org/x/17891756> for info on lock files).
If it is stuck, there are ways to get a stack trace from it, but I think
they normally require some modification to the code before it starts.
There is another possibility that the hung process is IncomingRunner. In
this case, the out/ queue would be empty and the in/ queue would have
the one .bak file. The rest of the above applies.
My guess though is OutgoingRunner is stuck. You stop Mailman and it
mostly stops, but not OutgoingRunner. You then start it again and this
starts a new OutgoingRunner which sees the old one 'crashed' and left
the .bak behind and starts by processing that. Do this a few times and
you have a number of stuck OutgoingRunner processes hanging around.
--
Mark Sapiro <mark at msapiro.net> The highway is for gamblers,
San Francisco Bay Area, California better use your sense - B. Dylan
More information about the Mailman-Users
mailing list