Hello all - mailman down after power failure and hard shutdown

Hello,
Ok, we had a power failure, and apparently my UPS thought it had more time left than it did, as the UPS shut down before it shut down the system.
Everything is back up and running, and postfix is running fine for all other mail, except list/mailman mail.
I'm getting the following error when trying to send an email to one of the lists:
2013-06-08T06:30:47-04:00 myhost postfix/postsuper[29691]: Requeued: 1 message 2013-06-08T06:31:12-04:00 myhost postfix/pickup[3124]: D55D7B7D175: uid=207 from=<valid-list@media-brokers.com> orig_id=45BF8B7B393 2013-06-08T06:31:12-04:00 myhost postfix/cleanup[29631]: D55D7B7D175: message-id=<51B30786.7020805@media-brokers.com> 2013-06-08T06:31:12-04:00 myhost postfix/qmgr[3126]: D55D7B7D175: from=<valid-list-bounces@media-brokers.com>, size=4065, nrcpt=6 (queue active) 2013-06-08T06:31:12-04:00 myhost postfix/qmgr[3126]: warning: connect to transport private/local: Resource temporarily unavailable 2013-06-08T06:31:12-04:00 myhost postfix/qmgr[3126]: warning: connect to transport private/retry: Resource temporarily unavailable
I've run check_perms and it says 'No problems found'...
Anyone have any suggestions?
Thanks,
charles

On 6/8/2013 6:43 AM, Tanstaafl wrote:
Is mailman possibly not running? Try this: ps -A | grep mailmanctl
If that gives blank output, try this: /usr/lib/mailman/bin/mailmanctl start
(This was the solution for me when I had a similar problem a month and a half ago. I would like to know where to plug this in so it happens automatically on reboot. That should be an elementary question but I'm still not familiar with all these sysadmin tasks.)
-- Larry Kuenning larry@qhpress.org

On 2013-06-08 8:10 AM, Larry Kuenning <larry@qhpress.org> wrote:
Not blank - but what does the question mark mean?
# ps -A | grep mailmanctl 2600 ? 00:00:00 mailmanctl
I've tried restarting mailman (appears to work), and even tried rebooting...
Thanks for the assist - any other ideas?
Note: I think this is related to the three postfix errors I posted regarding a problem with the local transport - but I've googled and can't find a solution for that either...
I only posted two of these here:
2013-06-08T06:31:12-04:00 myhost postfix/qmgr[3126]: warning: connect to transport private/local: Resource temporarily unavailable 2013-06-08T06:31:12-04:00 myhost postfix/qmgr[3126]: warning: connect to transport private/retry: Resource temporarily unavailable
The third, which I don't see every time, is:
postfix/master[29913]: warning: master_wakeup_timer_event: service tlsmgr(private/tlsmgr): Resource temporarily unavailable

On 6/8/2013 8:52 AM, Tanstaafl wrote:
Leaving out the "grep" to get the header ("ps -C mailmanctl" would have been better to start with) I see that that column is headed "TTY". I guess the question mark means the process is not tied to a terminal and so will continue running even if all users log out. Which is the behavior you want, so the problem must be elsewhere.
Thanks for the assist - any other ideas?
Now you need help from somebody who actually knows how Mailman works.
-- Larry Kuenning larry@qhpress.org

On 06/08/2013 08:33 AM, Larry Kuenning wrote:
And 'ps -fwC python' or 'ps -fwu mailman' will show the qrunners too, but all this is moot as it is extremely unlikely that Postfix errors have anything to do with whether or not Mailman is actually running.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan

I'd suggest trying "ps -auxww|grep mailman" to seem if any mailman processes are running, this assumes mailman runs as its own user id. Some installs use the username list or lists instead of mailman.
If nothing show up then I'd check:
/etc/postfix/*.cf and /etc/postfix/transport and diff them with an older copy to make sure they haven't changed.
check /var/lib/mailman/qfiles/maildir The actual location may depend on your version and installation options. If mailman is NOT running then the cur subdir should be empty. I've found mailman will not restart if there is anything in the directory cur. I'd check the files, if any, in both new and cur and tmp just to see what's there.

Thanks for trying, but mailman is running fine.
Lists that have only real email addresses work fine.
Also, individual messages invoking postfix/local also work fine, (ie, emails sent from cron (8 from last night and this morning), etc)...
Mark has helped me narrow the problem down to whenever multiple messages are submitted to postfix/local simultaneously.
On 2013-06-08 2:20 PM, Richard Shetron <guest2@sgeinc.com> wrote:

On 06/08/2013 05:10 AM, Larry Kuenning wrote:
The GNU Mailman tarball distribution contains misc/mailman.in which configure uses to make misc/mailman.
This is a sample init.d script for Mailman and it contains instructions for installing and activating it on RedHat/CentOS and Debian/Ubuntu.
And if you installed Mailman from a package, your packager should have provided this or something similar.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan

On 06/08/2013 03:43 AM, Tanstaafl wrote:
Postfix has received the message and is trying to deliver it via the local transport which is good.
2013-06-08T06:31:12-04:00 myhost postfix/qmgr[3126]: warning: connect to transport private/local: Resource temporarily unavailable
but Postfix can't find the local transport or more likely there is a stale lock on the transport left over from before the crash, so Postfix tries to queue the message for retry.
2013-06-08T06:31:12-04:00 myhost postfix/qmgr[3126]: warning: connect to transport private/retry: Resource temporarily unavailable
but it can't access the retry transport either ...
I've run check_perms and it says 'No problems found'...
Because this isn't a Mailman problem. It's a Postfix problem. I don't know enough Postfix to point directly at a solution, but I doubt that Postfix can deliver any mail via the local transport.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan

On 06/08/2013 10:20 AM, Mark Sapiro wrote:
Actually, private/local and private/retry refer to the sockets used for communication between the Postfix master and the various daemons. If you do 'netstat -l' you should see these and many others 'LISTENING', Do you?
I don't know why a reboot or even just a stop and start of Postfix doesn't fix this. If you stop and start Postfix, are there any messages in the mail logs beyond the "postfix/master[pppp]: daemon started ..." message?
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan

On 2013-06-08 1:58 PM, Mark Sapiro <mark@msapiro.net> wrote:
Yep, they're all there. And local is working - at least sometimes (see below) :(
Nothing more than the three warnings I already posted, two of which you see below, and the third being:
2013-06-08T13:10:19-04:00 myhost postfix/master[4076]: warning: master_wakeup_timer_event: service tlsmgr(private/tlsmgr): Resource temporarily unavailable
But, I have more details after some testing...
First, mailman is definitely working. I tested sending to one of my test lists with just two people on it, and it works fine:
I tested with another list that has 6 people on it, two of whom have their vacation message enabled (I use postfixadmin vacation), and while all 6 recipients got the message, there were two messages that got stuck in the queue that are related to the vacation message:
As you can see, only the two vacation messages are deferred with transport unavailable.
It also appears that the problem manifests with NESTED lists:
I imagine that the two problems are being caused by the same problem, whatever it is...
It also seems to be something to do with how many recipients are involved. One or two appear to be ok, but more than that and it gets iffy...
Appreciate any more thoughts on this weirdness, because I'm stumped....

On 06/08/2013 02:21 PM, Tanstaafl wrote:
I think that's a coincidence. The biggest problem is with delivery from Postfix to Mailman, at which point nothing knows how many list members there are or how many messages Mailman will send.
Appreciate any more thoughts on this weirdness, because I'm stumped....
See this <http://tech.groups.yahoo.com/group/postfix-users/message/245375>, particularly the replies from Wietse.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan

On 2013-06-08 5:44 PM, Mark Sapiro <mark@msapiro.net> wrote:
I read them all, and I don't think it is relevant (this is the same kernel and same versions of postfix dovecot and mailman for some time now), but, I changed the default limit to 10 and reloaded postfix, with the same error when sending to my 'All' list (that has only 6 members, all lists).
Also, as I said, lists that only have individual recipients work just fine, even with 30+ recipients.
Also the weirdness when a list member has their vacation enabled - they get the original list message, but the vacation message gets stuck in the queue with the error.
I'm thinking of trying to reinstalling (this is gentoo, so that will be easy) first mailman, then postfix... I'll probably try that tomorrow if no other solution presents itself.
Thanks for your help, Mark, much appreciated...

On 06/08/2013 03:10 PM, Tanstaafl wrote:
How long was the system up before the crash, and during that time did you change any dynamic configuration parameters the would have been reverted by the crash.
Also, as I said, lists that only have individual recipients work just fine, even with 30+ recipients.
So, the doesn't occur with a single message to a single list, but it occurs when Postfix receives six messages at once FROM the lists-all list. Also your deliveries to to=<validuser1@media-brokers.com> et al are via the virtual transport which is apparently unaffected.
Another case of multiple messages to be handled by the local transport.
If you reinstall Mailman without touching Postfix and that fixes this, I'll be incredibly surprised.
All the evidence you've presented together with everything I know says this is a Postfix issue, not a Mailman issue. If I knew Postfix as well as I know Mailman, I could probably tell you how to fix this.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan

On 2013-06-08 7:13 PM, Mark Sapiro <mark@msapiro.net> wrote:
Hmm, so when a list only contains other lists as members, those will use postfix's local transport, but when the members are individuals (for final delivery), it uses virtual. Ok, that makes sense then.
Another case of multiple messages to be handled by the local transport.
Ok, yeah, I think you've nailed it... the problem is when more than one message at a time is passed to postfix/local...
If you reinstall Mailman without touching Postfix and that fixes this, I'll be incredibly surprised.
I think you're right, I'll do postfix first.
Wish I did... I did get a comment from Victor on the postfix list to check all of my aliases, so I ran newaliases but that didn't help. Is there anything else I can do to test the mailman aliases? Since the individual lists work - confirmed because I sent the mass email I've been trying to send since this happened to each individual list that is a member of the lists-all list, and those all worked fine.
I agree with you that this seems to be a postfix problem, but is it possible that some kind of corruption in a userb could cause these warnings? To recap, they are:
The first one from postfix/master only shows up rarely - 11 times since I got the system back up, and within 5 or 10 minutes (but usually with 5 or 10 seconds) of postfix being restarted:
postfix/master[6406]: warning: master_wakeup_timer_event: service tlsmgr(private/tlsmgr): Resource temporarily unavailable
Then these (only when I try to send to my lists-all list):
warning: connect to transport private/local: Resource temporarily unavailable warning: connect to transport private/retry: Resource temporarily unavailable
I do have backups of my mysql userdb, as well as all others (mailman aliases/dbs, etc), so I can replace any of these from backups if it will fix the problem.
Thanks again for your time and help Mark...
Charles

On 06/09/2013 05:49 AM, Tanstaafl wrote:
I thought about aliases, but aliases are only consulted by the local transport, and the issue is in passing the message to the local transport (and also the retry transport and the vacation transport). Thus, I don't think aliases could be involved.
However, if aliases were involved, the thing to run is Mailman's bin/genaliases, but we know aliases are not the problem, both from the above and the fact that the lists all work 'one at a time'
There is definitely some resource contention issue when Postfix is trying to access the same socket for multiple messages.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan

Ok, facepalm time...
I had forgotten that I had built a new kernel a few weeks ago, and changed it to the default - but hadn't properly tested it yet.
Reverting to the previous kernel resolved the problem.
I'm not sure what the heck I changed to cause this, but that'll sure tech me to never change the kernel boot default without proper testing.
Anyway, thanks for the assist and sorry for the noise.
Charles
On 2013-06-09 10:33 AM, Mark Sapiro <mark@msapiro.net> wrote:

On 6/8/2013 6:43 AM, Tanstaafl wrote:
Is mailman possibly not running? Try this: ps -A | grep mailmanctl
If that gives blank output, try this: /usr/lib/mailman/bin/mailmanctl start
(This was the solution for me when I had a similar problem a month and a half ago. I would like to know where to plug this in so it happens automatically on reboot. That should be an elementary question but I'm still not familiar with all these sysadmin tasks.)
-- Larry Kuenning larry@qhpress.org

On 2013-06-08 8:10 AM, Larry Kuenning <larry@qhpress.org> wrote:
Not blank - but what does the question mark mean?
# ps -A | grep mailmanctl 2600 ? 00:00:00 mailmanctl
I've tried restarting mailman (appears to work), and even tried rebooting...
Thanks for the assist - any other ideas?
Note: I think this is related to the three postfix errors I posted regarding a problem with the local transport - but I've googled and can't find a solution for that either...
I only posted two of these here:
2013-06-08T06:31:12-04:00 myhost postfix/qmgr[3126]: warning: connect to transport private/local: Resource temporarily unavailable 2013-06-08T06:31:12-04:00 myhost postfix/qmgr[3126]: warning: connect to transport private/retry: Resource temporarily unavailable
The third, which I don't see every time, is:
postfix/master[29913]: warning: master_wakeup_timer_event: service tlsmgr(private/tlsmgr): Resource temporarily unavailable

On 6/8/2013 8:52 AM, Tanstaafl wrote:
Leaving out the "grep" to get the header ("ps -C mailmanctl" would have been better to start with) I see that that column is headed "TTY". I guess the question mark means the process is not tied to a terminal and so will continue running even if all users log out. Which is the behavior you want, so the problem must be elsewhere.
Thanks for the assist - any other ideas?
Now you need help from somebody who actually knows how Mailman works.
-- Larry Kuenning larry@qhpress.org

On 06/08/2013 08:33 AM, Larry Kuenning wrote:
And 'ps -fwC python' or 'ps -fwu mailman' will show the qrunners too, but all this is moot as it is extremely unlikely that Postfix errors have anything to do with whether or not Mailman is actually running.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan

I'd suggest trying "ps -auxww|grep mailman" to seem if any mailman processes are running, this assumes mailman runs as its own user id. Some installs use the username list or lists instead of mailman.
If nothing show up then I'd check:
/etc/postfix/*.cf and /etc/postfix/transport and diff them with an older copy to make sure they haven't changed.
check /var/lib/mailman/qfiles/maildir The actual location may depend on your version and installation options. If mailman is NOT running then the cur subdir should be empty. I've found mailman will not restart if there is anything in the directory cur. I'd check the files, if any, in both new and cur and tmp just to see what's there.

Thanks for trying, but mailman is running fine.
Lists that have only real email addresses work fine.
Also, individual messages invoking postfix/local also work fine, (ie, emails sent from cron (8 from last night and this morning), etc)...
Mark has helped me narrow the problem down to whenever multiple messages are submitted to postfix/local simultaneously.
On 2013-06-08 2:20 PM, Richard Shetron <guest2@sgeinc.com> wrote:

On 06/08/2013 05:10 AM, Larry Kuenning wrote:
The GNU Mailman tarball distribution contains misc/mailman.in which configure uses to make misc/mailman.
This is a sample init.d script for Mailman and it contains instructions for installing and activating it on RedHat/CentOS and Debian/Ubuntu.
And if you installed Mailman from a package, your packager should have provided this or something similar.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan

On 06/08/2013 03:43 AM, Tanstaafl wrote:
Postfix has received the message and is trying to deliver it via the local transport which is good.
2013-06-08T06:31:12-04:00 myhost postfix/qmgr[3126]: warning: connect to transport private/local: Resource temporarily unavailable
but Postfix can't find the local transport or more likely there is a stale lock on the transport left over from before the crash, so Postfix tries to queue the message for retry.
2013-06-08T06:31:12-04:00 myhost postfix/qmgr[3126]: warning: connect to transport private/retry: Resource temporarily unavailable
but it can't access the retry transport either ...
I've run check_perms and it says 'No problems found'...
Because this isn't a Mailman problem. It's a Postfix problem. I don't know enough Postfix to point directly at a solution, but I doubt that Postfix can deliver any mail via the local transport.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan

On 06/08/2013 10:20 AM, Mark Sapiro wrote:
Actually, private/local and private/retry refer to the sockets used for communication between the Postfix master and the various daemons. If you do 'netstat -l' you should see these and many others 'LISTENING', Do you?
I don't know why a reboot or even just a stop and start of Postfix doesn't fix this. If you stop and start Postfix, are there any messages in the mail logs beyond the "postfix/master[pppp]: daemon started ..." message?
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan

On 2013-06-08 1:58 PM, Mark Sapiro <mark@msapiro.net> wrote:
Yep, they're all there. And local is working - at least sometimes (see below) :(
Nothing more than the three warnings I already posted, two of which you see below, and the third being:
2013-06-08T13:10:19-04:00 myhost postfix/master[4076]: warning: master_wakeup_timer_event: service tlsmgr(private/tlsmgr): Resource temporarily unavailable
But, I have more details after some testing...
First, mailman is definitely working. I tested sending to one of my test lists with just two people on it, and it works fine:
I tested with another list that has 6 people on it, two of whom have their vacation message enabled (I use postfixadmin vacation), and while all 6 recipients got the message, there were two messages that got stuck in the queue that are related to the vacation message:
As you can see, only the two vacation messages are deferred with transport unavailable.
It also appears that the problem manifests with NESTED lists:
I imagine that the two problems are being caused by the same problem, whatever it is...
It also seems to be something to do with how many recipients are involved. One or two appear to be ok, but more than that and it gets iffy...
Appreciate any more thoughts on this weirdness, because I'm stumped....

On 06/08/2013 02:21 PM, Tanstaafl wrote:
I think that's a coincidence. The biggest problem is with delivery from Postfix to Mailman, at which point nothing knows how many list members there are or how many messages Mailman will send.
Appreciate any more thoughts on this weirdness, because I'm stumped....
See this <http://tech.groups.yahoo.com/group/postfix-users/message/245375>, particularly the replies from Wietse.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan

On 2013-06-08 5:44 PM, Mark Sapiro <mark@msapiro.net> wrote:
I read them all, and I don't think it is relevant (this is the same kernel and same versions of postfix dovecot and mailman for some time now), but, I changed the default limit to 10 and reloaded postfix, with the same error when sending to my 'All' list (that has only 6 members, all lists).
Also, as I said, lists that only have individual recipients work just fine, even with 30+ recipients.
Also the weirdness when a list member has their vacation enabled - they get the original list message, but the vacation message gets stuck in the queue with the error.
I'm thinking of trying to reinstalling (this is gentoo, so that will be easy) first mailman, then postfix... I'll probably try that tomorrow if no other solution presents itself.
Thanks for your help, Mark, much appreciated...

On 06/08/2013 03:10 PM, Tanstaafl wrote:
How long was the system up before the crash, and during that time did you change any dynamic configuration parameters the would have been reverted by the crash.
Also, as I said, lists that only have individual recipients work just fine, even with 30+ recipients.
So, the doesn't occur with a single message to a single list, but it occurs when Postfix receives six messages at once FROM the lists-all list. Also your deliveries to to=<validuser1@media-brokers.com> et al are via the virtual transport which is apparently unaffected.
Another case of multiple messages to be handled by the local transport.
If you reinstall Mailman without touching Postfix and that fixes this, I'll be incredibly surprised.
All the evidence you've presented together with everything I know says this is a Postfix issue, not a Mailman issue. If I knew Postfix as well as I know Mailman, I could probably tell you how to fix this.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan

On 2013-06-08 7:13 PM, Mark Sapiro <mark@msapiro.net> wrote:
Hmm, so when a list only contains other lists as members, those will use postfix's local transport, but when the members are individuals (for final delivery), it uses virtual. Ok, that makes sense then.
Another case of multiple messages to be handled by the local transport.
Ok, yeah, I think you've nailed it... the problem is when more than one message at a time is passed to postfix/local...
If you reinstall Mailman without touching Postfix and that fixes this, I'll be incredibly surprised.
I think you're right, I'll do postfix first.
Wish I did... I did get a comment from Victor on the postfix list to check all of my aliases, so I ran newaliases but that didn't help. Is there anything else I can do to test the mailman aliases? Since the individual lists work - confirmed because I sent the mass email I've been trying to send since this happened to each individual list that is a member of the lists-all list, and those all worked fine.
I agree with you that this seems to be a postfix problem, but is it possible that some kind of corruption in a userb could cause these warnings? To recap, they are:
The first one from postfix/master only shows up rarely - 11 times since I got the system back up, and within 5 or 10 minutes (but usually with 5 or 10 seconds) of postfix being restarted:
postfix/master[6406]: warning: master_wakeup_timer_event: service tlsmgr(private/tlsmgr): Resource temporarily unavailable
Then these (only when I try to send to my lists-all list):
warning: connect to transport private/local: Resource temporarily unavailable warning: connect to transport private/retry: Resource temporarily unavailable
I do have backups of my mysql userdb, as well as all others (mailman aliases/dbs, etc), so I can replace any of these from backups if it will fix the problem.
Thanks again for your time and help Mark...
Charles

On 06/09/2013 05:49 AM, Tanstaafl wrote:
I thought about aliases, but aliases are only consulted by the local transport, and the issue is in passing the message to the local transport (and also the retry transport and the vacation transport). Thus, I don't think aliases could be involved.
However, if aliases were involved, the thing to run is Mailman's bin/genaliases, but we know aliases are not the problem, both from the above and the fact that the lists all work 'one at a time'
There is definitely some resource contention issue when Postfix is trying to access the same socket for multiple messages.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan

Ok, facepalm time...
I had forgotten that I had built a new kernel a few weeks ago, and changed it to the default - but hadn't properly tested it yet.
Reverting to the previous kernel resolved the problem.
I'm not sure what the heck I changed to cause this, but that'll sure tech me to never change the kernel boot default without proper testing.
Anyway, thanks for the assist and sorry for the noise.
Charles
On 2013-06-09 10:33 AM, Mark Sapiro <mark@msapiro.net> wrote:
participants (4)
-
Larry Kuenning
-
Mark Sapiro
-
Richard Shetron
-
Tanstaafl