Mailman stuck : mailmanctl dead with messages in /qfiles/in
![](https://secure.gravatar.com/avatar/f09b2041f835846507a3295a1ea20bcf.jpg?s=120&d=mm&r=g)
Hi.
Mailman somtimes gets stuck on my installation. By stuck, I mean I get no mail and no subscription notification. Until I relaunch it.
I generally does not happen for a few months, but it is the second time this month.
Troubleshooting
http://wiki.list.org/display/DOC/4.78+Troubleshooting-+No+mail+going+out+to+...
2/ Cron/mailmanctl
ps auxww| grep mailmanctl |grep -v grep -> Nothing.
7/ Locks
/var/lib/mailman/locks -> /var/lock/mailman
ll /var/lock/mailman total 0
8/ Logs
/var/log/mailman/error : Apr 30 03:16:21 2012 mailmanctl(11685): No child with pid: 17093 Apr 30 03:16:21 2012 mailmanctl(11685): [Errno 3] No such process Apr 30 03:16:21 2012 mailmanctl(11685): Stale pid file removed.
9/ Qfiles
All queues are empty, except for "virgin" and "in".
/var/lib/mailman/qfiles/in: total 168 -rw-rw---- 1 nobody list 7080 Apr 29 08:12 1335679972.4044311+fe0d8677ff8b10bdf54bbe785264c433c40927a2.pck -rw-rw---- 1 nobody list 19014 Apr 29 09:41 1335685311.8046031+26be5e0c4630e8401808bdf46c701a1975950cec.pck -rw-rw---- 1 nobody list 13435 Apr 29 10:20 1335687645.366998+ceda98fd4ccfb8a3cc6e8bb9d41e943e0ef8ec3f.pck -rw-rw---- 1 nobody list 14010 Apr 29 12:17 1335694670.941314+ab4263a19afa990cc933e18ccf1ab3a86ef02b84.pck -rw-rw---- 1 nobody list 26875 Apr 29 14:00 1335700832.115989+38cf9b0c8d1015392d44ec8047287b9c2da44260.pck -rw-rw---- 1 nobody list 25951 Apr 29 14:02 1335700966.2184939+cf9d490300e3064c49a587a5efc0d2678e8c0b0a.pck -rw-rw---- 1 nobody list 33409 Apr 29 21:13 1335726794.7660511+a3e33986aa07f3c56214159c2644714ebecbb1ff.pck -rw-rw---- 1 nobody list 1660 Apr 29 23:25 1335734733.107975+629b7c7b1e0fab72b5b60556c4a9590971461f3d.pck -rw-rw---- 1 nobody list 1420 Apr 30 11:43 1335778989.361089+5a2c8db1b2e83fb4c56900018fd315fce62c9aa0.pck -rw-rw---- 1 nobody list 2341 Apr 30 15:20 1335792010.1874161+1e1d6b7aff6b91871fc0253416f32f040bb3b8b3.pck -rw-rw---- 1 nobody list 3225 Apr 30 20:25 1335810323.462666+60026e952d7f0bfe526ae95fe4cb2562051c93c0.pck
Debug info
Last time it got stuck, I did the modification suggested here :
http://wiki.list.org/display/DOC/4.73+How+do+I+debug+smtp-failure+problems+-...
modified /var/lib/mailman/Mailman/Handlers/SMTPDirect.py to add self.__conn.set_debuglevel(1)
Configuration
Not sure this is useful, but /etc/mailman/mm_cfg.py contains MTA='LocalPostfix' POSTFIX_STYLE_VIRTUAL_DOMAINS = ['domain1.tld', 'domain2.tld']
I would like to figure out what is happening, therefore I don't relaunch, in case there would be some more information I could get.
Any advice ?
Thanks.
-- Jérôme
![](https://secure.gravatar.com/avatar/56f108518d7ee2544412cc80978e3182.jpg?s=120&d=mm&r=g)
On 4/30/2012 3:43 PM, Jérôme wrote:
Mailman somtimes gets stuck on my installation. By stuck, I mean I get no mail and no subscription notification. Until I relaunch it.
[...]
How about
ps auxww| grep qrunner |grep -v grep
It appears that some process or person is stopping Mailman.
How about /var/log/mailman/qrunner ?
And yet you are not logging any smtp debugging in Mailman's error log. There should be copious log information for every outgoing message.
The above line should cause significant problems when attempting to create or remove lists. it MUST be one of
MTA = 'Postfix' MTA = 'Manual' MTA = None
'Postfix' means generate aliases and virtual-mailman files for Postfix. 'Manual' means display the necessary aliases None means don't do anything with aliases when lists are created/removed.
What's in the qrunner log?
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
![](https://secure.gravatar.com/avatar/f09b2041f835846507a3295a1ea20bcf.jpg?s=120&d=mm&r=g)
Hi.
Thanks for answering.
Mon, 30 Apr 2012 16:15:03 -0700 Mark Sapiro a écrit:
Nothing either.
OK. Need to figure out which.
Each day, I have something like this : Apr 28 03:16:33 2012 (17099) OutgoingRunner qrunner caught SIGHUP. Reopening logs. Apr 28 03:16:33 2012 (17094) ArchRunner qrunner caught SIGHUP. Reopening logs. Apr 28 03:16:33 2012 (17097) IncomingRunner qrunner caught SIGHUP. Reopening logs. Apr 28 03:16:33 2012 (17093) Master watcher caught SIGHUP. Re-opening log files. Apr 28 03:16:34 2012 (17095) BounceRunner qrunner caught SIGHUP. Reopening logs. Apr 28 03:16:34 2012 (17101) RetryRunner qrunner caught SIGHUP. Reopening logs. Apr 28 03:16:34 2012 (17096) CommandRunner qrunner caught SIGHUP. Reopening logs. Apr 28 03:16:34 2012 (17098) NewsRunner qrunner caught SIGHUP. Reopening logs. Apr 28 03:16:34 2012 (17100) VirginRunner qrunner caught SIGHUP. Reopening logs.
The day it stopped, I got this : Apr 29 03:16:29 2012 (17099) OutgoingRunner qrunner caught SIGHUP. Reopening logs. Apr 29 03:16:29 2012 (17094) ArchRunner qrunner caught SIGHUP. Reopening logs. Apr 29 03:16:29 2012 (17097) IncomingRunner qrunner caught SIGHUP. Reopening logs. Apr 29 03:16:29 2012 (17093) Master watcher caught SIGHUP. Re-opening log files. Apr 29 03:16:29 2012 (17097) IncomingRunner qrunner caught SIGTERM. Stopping. Apr 29 03:16:29 2012 (17099) OutgoingRunner qrunner caught SIGTERM. Stopping. Apr 29 03:16:29 2012 (17097) IncomingRunner qrunner exiting. Apr 29 03:16:29 2012 (17094) ArchRunner qrunner caught SIGTERM. Stopping. Apr 29 03:16:29 2012 (17099) OutgoingRunner qrunner exiting. Apr 29 03:16:29 2012 (17094) ArchRunner qrunner exiting. Apr 29 03:16:29 2012 (17096) CommandRunner qrunner caught SIGHUP. Reopening logs. Apr 29 03:16:29 2012 (17101) RetryRunner qrunner caught SIGHUP. Reopening logs. Apr 29 03:16:29 2012 (17095) BounceRunner qrunner caught SIGHUP. Reopening logs. Apr 29 03:16:29 2012 (17098) NewsRunner qrunner caught SIGHUP. Reopening logs. Apr 29 03:16:29 2012 (17098) NewsRunner qrunner caught SIGTERM. Stopping. Apr 29 03:16:29 2012 (17095) BounceRunner qrunner caught SIGTERM. Stopping. Apr 29 03:16:29 2012 (17096) CommandRunner qrunner caught SIGTERM. Stopping. Apr 29 03:16:29 2012 (17101) RetryRunner qrunner caught SIGTERM. Stopping. Apr 29 03:16:29 2012 (17100) VirginRunner qrunner caught SIGHUP. Reopening logs. Apr 29 03:16:29 2012 (17096) CommandRunner qrunner exiting. Apr 29 03:16:29 2012 (17098) NewsRunner qrunner exiting. Apr 29 03:16:29 2012 (17095) BounceRunner qrunner exiting. Apr 29 03:16:29 2012 (17100) VirginRunner qrunner caught SIGTERM. Stopping. Apr 29 03:16:29 2012 (17101) RetryRunner qrunner exiting. Apr 29 03:16:29 2012 (17100) VirginRunner qrunner exiting.
Sorry for the mess, here. But I think you get the idea.
Seems to happen during a cron job.
Bug reports that could be related : http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=505638 https://bugs.launchpad.net/mailman/+bug/265855
There was. But it stopped. Last message for which I do have a lot of info is on Apr 22, one week before mailman stopped sending messages.
-rw-rw-r-- 1 list list 198 Apr 30 03:16 /var/log/mailman/error -rw-rw-r-- 1 list list 0 Apr 22 03:16 /var/log/mailman/error.1 -rw-rw-r-- 1 list list 0 Apr 15 03:16 /var/log/mailman/error.2 -rw-rw-r-- 1 list list 36541617 Apr 22 01:59 /var/log/mailman/error.3
Should there be anything relevant in there ?
I configured mailman 3 years ago. I don't remember everything but it comes from here : http://isp-control.net/documentation/howto/mail/setup_mailman
Is it such a bad idea ?
I suppose it is unrelated, anyway.
Good thing is there is a relatively recent bug opened on debian that might be closed if we managed to rootcause and solve this.
I just did a little bit of cleanup tonight, after I realized the server was almost full. At least the partition that hosts mailman queues and logs. Would we see something specific in case of lack of space ?
Thank you for your help.
-- Jérôme
![](https://secure.gravatar.com/avatar/56f108518d7ee2544412cc80978e3182.jpg?s=120&d=mm&r=g)
On 4/30/2012 5:19 PM, Jérôme wrote:
That is presumably a logrotate process.
So here in addition to the normal SIGHUPs, presumably from logrotate, you also have SIGTERMs possibly from a "bin/mailmanctl stop" although you don't show a "Master watcher caught SIGTERM" entry.
Check your system cron log to see what was running at the time.
No. That isn't relevant to this issue. Apparently, something reversed the patch on April 22 which is just as well.
The curious thing is you are rotating logs weekly, but "reopening" them daily. Either this is two different processes or your logrotate script is a bit strange.
Which is wrong as it contains lines like
MTA=Postfix
and
MTA=localPostfix
If you put either of those lines literally in mm_cfg.py without quotes around Postfix or localPostfix, Mailman won't run at all because every Mailman process will encounter a fatal error on importing mm_cfg.
Apparently you figured that out as you have quoted it.
MTA='LocalPostfix'
and presumably you have named your edited module LocalPostfix.py rather than localPostfix.py or it would be throwing errors when creating/removing lists.
Is it such a bad idea ?
No, because the howto you followed had you create Mailman/MTA/localPostfix.py (or LocalPostfix.py), but how was I supposed to know that?
See the FAQ at <http://wiki.list.org/x/OIDD>.
I suppose it is unrelated, anyway.
Yes, it is unrelated.
You would begin to see exceptions thrown for inability to create/write files.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
![](https://secure.gravatar.com/avatar/f09b2041f835846507a3295a1ea20bcf.jpg?s=120&d=mm&r=g)
Mon, 30 Apr 2012 19:14:48 -0700 Mark Sapiro a écrit:
cron log was not enabled...
I enabled it and relaunched mailman. I hope next time, the cron log will show something useful.
No. That isn't relevant to this issue. Apparently, something reversed the patch on April 22 which is just as well.
In fact, the patch is still applied. I see the line
self.__conn.set_debuglevel(1)
in
/var/lib/mailman/Mailman/Handlers/SMTPDirect.py
Anyway. It seems to be working again.
I'll try to figure that out.
[...]
Right. Sorry about that. I did that config 3 years ago and I don't remember much... Basically I wanted to mention I was using postfix.
Thank you for your help.
I'll come back here if I have more.
-- Jérôme
![](https://secure.gravatar.com/avatar/56f108518d7ee2544412cc80978e3182.jpg?s=120&d=mm&r=g)
On 4/30/2012 3:43 PM, Jérôme wrote:
Mailman somtimes gets stuck on my installation. By stuck, I mean I get no mail and no subscription notification. Until I relaunch it.
[...]
How about
ps auxww| grep qrunner |grep -v grep
It appears that some process or person is stopping Mailman.
How about /var/log/mailman/qrunner ?
And yet you are not logging any smtp debugging in Mailman's error log. There should be copious log information for every outgoing message.
The above line should cause significant problems when attempting to create or remove lists. it MUST be one of
MTA = 'Postfix' MTA = 'Manual' MTA = None
'Postfix' means generate aliases and virtual-mailman files for Postfix. 'Manual' means display the necessary aliases None means don't do anything with aliases when lists are created/removed.
What's in the qrunner log?
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
![](https://secure.gravatar.com/avatar/f09b2041f835846507a3295a1ea20bcf.jpg?s=120&d=mm&r=g)
Hi.
Thanks for answering.
Mon, 30 Apr 2012 16:15:03 -0700 Mark Sapiro a écrit:
Nothing either.
OK. Need to figure out which.
Each day, I have something like this : Apr 28 03:16:33 2012 (17099) OutgoingRunner qrunner caught SIGHUP. Reopening logs. Apr 28 03:16:33 2012 (17094) ArchRunner qrunner caught SIGHUP. Reopening logs. Apr 28 03:16:33 2012 (17097) IncomingRunner qrunner caught SIGHUP. Reopening logs. Apr 28 03:16:33 2012 (17093) Master watcher caught SIGHUP. Re-opening log files. Apr 28 03:16:34 2012 (17095) BounceRunner qrunner caught SIGHUP. Reopening logs. Apr 28 03:16:34 2012 (17101) RetryRunner qrunner caught SIGHUP. Reopening logs. Apr 28 03:16:34 2012 (17096) CommandRunner qrunner caught SIGHUP. Reopening logs. Apr 28 03:16:34 2012 (17098) NewsRunner qrunner caught SIGHUP. Reopening logs. Apr 28 03:16:34 2012 (17100) VirginRunner qrunner caught SIGHUP. Reopening logs.
The day it stopped, I got this : Apr 29 03:16:29 2012 (17099) OutgoingRunner qrunner caught SIGHUP. Reopening logs. Apr 29 03:16:29 2012 (17094) ArchRunner qrunner caught SIGHUP. Reopening logs. Apr 29 03:16:29 2012 (17097) IncomingRunner qrunner caught SIGHUP. Reopening logs. Apr 29 03:16:29 2012 (17093) Master watcher caught SIGHUP. Re-opening log files. Apr 29 03:16:29 2012 (17097) IncomingRunner qrunner caught SIGTERM. Stopping. Apr 29 03:16:29 2012 (17099) OutgoingRunner qrunner caught SIGTERM. Stopping. Apr 29 03:16:29 2012 (17097) IncomingRunner qrunner exiting. Apr 29 03:16:29 2012 (17094) ArchRunner qrunner caught SIGTERM. Stopping. Apr 29 03:16:29 2012 (17099) OutgoingRunner qrunner exiting. Apr 29 03:16:29 2012 (17094) ArchRunner qrunner exiting. Apr 29 03:16:29 2012 (17096) CommandRunner qrunner caught SIGHUP. Reopening logs. Apr 29 03:16:29 2012 (17101) RetryRunner qrunner caught SIGHUP. Reopening logs. Apr 29 03:16:29 2012 (17095) BounceRunner qrunner caught SIGHUP. Reopening logs. Apr 29 03:16:29 2012 (17098) NewsRunner qrunner caught SIGHUP. Reopening logs. Apr 29 03:16:29 2012 (17098) NewsRunner qrunner caught SIGTERM. Stopping. Apr 29 03:16:29 2012 (17095) BounceRunner qrunner caught SIGTERM. Stopping. Apr 29 03:16:29 2012 (17096) CommandRunner qrunner caught SIGTERM. Stopping. Apr 29 03:16:29 2012 (17101) RetryRunner qrunner caught SIGTERM. Stopping. Apr 29 03:16:29 2012 (17100) VirginRunner qrunner caught SIGHUP. Reopening logs. Apr 29 03:16:29 2012 (17096) CommandRunner qrunner exiting. Apr 29 03:16:29 2012 (17098) NewsRunner qrunner exiting. Apr 29 03:16:29 2012 (17095) BounceRunner qrunner exiting. Apr 29 03:16:29 2012 (17100) VirginRunner qrunner caught SIGTERM. Stopping. Apr 29 03:16:29 2012 (17101) RetryRunner qrunner exiting. Apr 29 03:16:29 2012 (17100) VirginRunner qrunner exiting.
Sorry for the mess, here. But I think you get the idea.
Seems to happen during a cron job.
Bug reports that could be related : http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=505638 https://bugs.launchpad.net/mailman/+bug/265855
There was. But it stopped. Last message for which I do have a lot of info is on Apr 22, one week before mailman stopped sending messages.
-rw-rw-r-- 1 list list 198 Apr 30 03:16 /var/log/mailman/error -rw-rw-r-- 1 list list 0 Apr 22 03:16 /var/log/mailman/error.1 -rw-rw-r-- 1 list list 0 Apr 15 03:16 /var/log/mailman/error.2 -rw-rw-r-- 1 list list 36541617 Apr 22 01:59 /var/log/mailman/error.3
Should there be anything relevant in there ?
I configured mailman 3 years ago. I don't remember everything but it comes from here : http://isp-control.net/documentation/howto/mail/setup_mailman
Is it such a bad idea ?
I suppose it is unrelated, anyway.
Good thing is there is a relatively recent bug opened on debian that might be closed if we managed to rootcause and solve this.
I just did a little bit of cleanup tonight, after I realized the server was almost full. At least the partition that hosts mailman queues and logs. Would we see something specific in case of lack of space ?
Thank you for your help.
-- Jérôme
![](https://secure.gravatar.com/avatar/56f108518d7ee2544412cc80978e3182.jpg?s=120&d=mm&r=g)
On 4/30/2012 5:19 PM, Jérôme wrote:
That is presumably a logrotate process.
So here in addition to the normal SIGHUPs, presumably from logrotate, you also have SIGTERMs possibly from a "bin/mailmanctl stop" although you don't show a "Master watcher caught SIGTERM" entry.
Check your system cron log to see what was running at the time.
No. That isn't relevant to this issue. Apparently, something reversed the patch on April 22 which is just as well.
The curious thing is you are rotating logs weekly, but "reopening" them daily. Either this is two different processes or your logrotate script is a bit strange.
Which is wrong as it contains lines like
MTA=Postfix
and
MTA=localPostfix
If you put either of those lines literally in mm_cfg.py without quotes around Postfix or localPostfix, Mailman won't run at all because every Mailman process will encounter a fatal error on importing mm_cfg.
Apparently you figured that out as you have quoted it.
MTA='LocalPostfix'
and presumably you have named your edited module LocalPostfix.py rather than localPostfix.py or it would be throwing errors when creating/removing lists.
Is it such a bad idea ?
No, because the howto you followed had you create Mailman/MTA/localPostfix.py (or LocalPostfix.py), but how was I supposed to know that?
See the FAQ at <http://wiki.list.org/x/OIDD>.
I suppose it is unrelated, anyway.
Yes, it is unrelated.
You would begin to see exceptions thrown for inability to create/write files.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
![](https://secure.gravatar.com/avatar/f09b2041f835846507a3295a1ea20bcf.jpg?s=120&d=mm&r=g)
Mon, 30 Apr 2012 19:14:48 -0700 Mark Sapiro a écrit:
cron log was not enabled...
I enabled it and relaunched mailman. I hope next time, the cron log will show something useful.
No. That isn't relevant to this issue. Apparently, something reversed the patch on April 22 which is just as well.
In fact, the patch is still applied. I see the line
self.__conn.set_debuglevel(1)
in
/var/lib/mailman/Mailman/Handlers/SMTPDirect.py
Anyway. It seems to be working again.
I'll try to figure that out.
[...]
Right. Sorry about that. I did that config 3 years ago and I don't remember much... Basically I wanted to mention I was using postfix.
Thank you for your help.
I'll come back here if I have more.
-- Jérôme
participants (2)
-
Jérôme
-
Mark Sapiro