Re: [Mailman-Users] {Disarmed} Re: unable to access admin interface

On 07/22/2014 01:56 PM, Chad Rebuck wrote:
There is definitely a locking issue. It's not that there are too many per se, it's that the list is locked by a defunct process and the lock will never be relinquished.
If you
cat /var/lib/mailman/locks/a2-16v-list.lock
the contents will be the file name of the active lock. Judging from the link count, it's this one
-rw-rw-r-- 2 http mailman 52 Jul 22 2014 a2-16v-list.lock.pogo.20307.0
do 'ps -fww 20307' or whatever the PID of the active lock is. If that PID doesn't exist, do
rm /var/lib/mailman/locks/a2-16v-list.lock.pogo.19620.0
and wait a few seconds. If all the locks don't clear, repeat until they do.
If a PID with the lock does exist, ps -fww will show what it is which may help identfy why it is hung with the list locked.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan

Should I delete each lock in order from oldest to newest, excluding the active lock (20307)? I've deleted a few and waited a few minutes between each delete and saw no locks auto disappear yet.
Would I delete a2-16v-list.lock is I make it down the list that far should I leave it alone?
[root@pogo locks]# ls -lsrt total 92 4 -rw-rw-r-- 1 http mailman 52 Jul 20 22:38 a2-16v-list.lock.pogo.20648.0 4 -rw-rw-r-- 1 http mailman 52 Jul 20 22:40 a2-16v-list.lock.pogo.20663.0 4 -rw-rw-r-- 1 http mailman 52 Jul 20 22:44 a2-16v-list.lock.pogo.20674.0 4 -rw-rw-r-- 1 http mailman 52 Jul 20 22:48 a2-16v-list.lock.pogo.20699.0 4 -rw-rw-r-- 1 http mailman 52 Jul 20 22:57 a2-16v-list.lock.pogo.20827.0 4 -rw-rw-r-- 1 http mailman 52 Jul 20 23:16 a2-16v-list.lock.pogo.20980.0 4 -rw-rw-r-- 1 http mailman 52 Jul 20 23:19 a2-16v-list.lock.pogo.20989.0 4 -rw-rw-r-- 1 http mailman 52 Jul 20 23:54 a2-16v-list.lock.pogo.21944.0 4 -rw-rw-r-- 1 http mailman 50 Jul 21 00:16 a2-16v-list.lock.pogo.406.0 4 -rw-rw-r-- 1 http mailman 50 Jul 21 00:20 a2-16v-list.lock.pogo.435.0 4 -rw-rw-r-- 1 http mailman 50 Jul 21 00:31 a2-16v-list.lock.pogo.484.0 4 -rw-rw-r-- 2 http mailman 52 Jul 22 2014 a2-16v-list.lock.pogo.20307.0 4 -rw-rw-r-- 2 http mailman 52 Jul 22 2014 a2-16v-list.lock 4 -rw-rw-r-- 1 http mailman 52 Jul 22 2014 a2-16v-list.lock.pogo.20311.0 4 -rw-rw-r-- 1 http mailman 52 Jul 22 2014 a2-16v-list.lock.pogo.20312.0 4 -rw-rw-r-- 1 http mailman 52 Jul 22 2014 a2-16v-list.lock.pogo.20313.0 4 -rw-rw-r-- 1 http mailman 52 Jul 22 2014 a2-16v-list.lock.pogo.20316.0 4 -rw-rw-r-- 1 http mailman 52 Jul 22 2014 a2-16v-list.lock.pogo.20323.0 4 -rw-rw-r-- 1 http mailman 52 Jul 22 2014 a2-16v-list.lock.pogo.20351.0 4 -rw-rw-r-- 1 http mailman 52 Jul 22 2014 a2-16v-list.lock.pogo.20519.0 4 -rw-rw-r-- 1 http mailman 52 Jul 22 2014 a2-16v-list.lock.pogo.20598.0 4 -rw-rw-r-- 2 mailman mailman 46 Jul 23 2014 master-qrunner.pogo.341 4 -rw-rw-r-- 2 mailman mailman 46 Jul 23 2014 master-qrunner [root@pogo locks]#
On Tue, Jul 22, 2014 at 5:11 PM, Mark Sapiro <mark@msapiro.net> wrote:

On 07/22/2014 02:20 PM, Chad Rebuck wrote:
You can go through the whole list looking for non-existent PIDs and delete all those, or you can delete just the active lock as that is the one preventing the others from proceeding.
I.e., delete the active lock (assuming its PID is gone) and some waiting processes may proceed. In any case, a new lock will become active. Repeat that until no locks remain.
Would I delete a2-16v-list.lock is I make it down the list that far should I leave it alone?
If at some point you are left with only the a2-16v-list.lock and no a2-16v-list.lock.pogo.PID.* files, you can just remove a2-16v-list.lock, or I think if you don't, it will be automatically removed the next time a process locks and unlocks the a2-16v-list list.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan

All the lock files have been deleted as I found no process running for any of them. However, the problem still occurs. What should I attempt next?
Thanks.
On Tue, Jul 22, 2014 at 5:29 PM, Mark Sapiro <mark@msapiro.net> wrote:

On 07/22/2014 07:10 PM, Chad Rebuck wrote:
What is the content of the /var/lib/mailman/locks/ directory now?
Did you remove the a2-16v-list.lock file? If not, you probably need to.
Is the message in the web server error log the same as before?
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan

I didn't remove it until now, but I saw it seemed to update itself when I tried to access the admin page.
[root@pogo locks]# ll total 16K -rw-rw-r-- 1 http mailman 52 Jul 23 2014 a2-16v-list.lock -rw-rw-r-- 1 http mailman 52 Jul 23 2014 a2-16v-list.lock.pogo.21765.0 -rw-rw-r-- 2 mailman mailman 46 Jul 24 2014 master-qrunner -rw-rw-r-- 2 mailman mailman 46 Jul 24 2014 master-qrunner.pogo.341
[root@pogo locks]# rm -rf a2-16v-list.lock*
[root@pogo locks]# ll total 8.0K -rw-rw-r-- 2 mailman mailman 46 Jul 24 2014 master-qrunner -rw-rw-r-- 2 mailman mailman 46 Jul 24 2014 master-qrunner.pogo.341
I was then able to access the admin page, but when I clicked on "tend to pending requests" I got the 504 error after a couple minutes
[root@pogo locks]# tail -2 /var/log/httpd/error_log [Tue Jul 22 22:22:30.721677 2014] [cgi:error] [pid 378:tid 1143579696] [client 144.160.5.25:34507] Script timed out before returning headers: admindb, referer: http://mail.a2-16v.com/mailman/admin/a2-16v-list [Tue Jul 22 22:23:30.781402 2014] [cgi:warn] [pid 378:tid 1143579696] [client 144.160.5.25:34507] AH01220: Timeout waiting for output from CGI script /usr/lib/mailman/cgi-bin/admindb, referer: http://mail.a2-16v.com/mailman/admin/a2-16v-list [root@pogo locks]# tail -2 /var/log/httpd/access_log 144.160.5.25 - - [22/Jul/2014:22:21:14 -0400] "GET /mailman/admin/a2-16v-list HTTP/1.1" 200 20748 144.160.5.25 - - [22/Jul/2014:22:21:30 -0400] "GET /mailman/admindb/a2-16v-list HTTP/1.1" 504 247
now the lock folder is as follows
[root@pogo locks]# ll total 16K -rw-rw-r-- 2 http mailman 52 Jul 23 2014 a2-16v-list.lock -rw-rw-r-- 2 http mailman 52 Jul 23 2014 a2-16v-list.lock.pogo.21828.0 -rw-rw-r-- 2 mailman mailman 46 Jul 24 2014 master-qrunner -rw-rw-r-- 2 mailman mailman 46 Jul 24 2014 master-qrunner.pogo.341 [root@pogo locks]#
On Tue, Jul 22, 2014 at 10:14 PM, Mark Sapiro <mark@msapiro.net> wrote:

On 07/22/2014 07:25 PM, Chad Rebuck wrote:
So PID 21828 has the list lock. It looks like the admindb CGI, PID 378 timed out, presumably because it couldn't get the lock, but its request would be a file named a2-16v-list.lock.pogo.378.0, and that doesn't seem to be there.
What if anything is in Mailman's 'error' log and 'locks' log?
Locking may be the underlying issue, but it's not just stale locks because even after they're cleared, you still have issues.
How did you install 2.1.18-1?
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan

I installed it with a package on my arch system using
pacman -Syu
[root@pogo chadr]# ll /var/lib/mailman/locks/ total 16K -rw-rw-r-- 1 http mailman 52 Jul 23 03:52 a2-16v-list.lock.pogo.22018.0 -rw-rw-r-- 1 http mailman 52 Jul 23 04:45 a2-16v-list.lock.pogo.22205.0 -rw-rw-r-- 2 mailman mailman 46 Jul 24 2014 master-qrunner -rw-rw-r-- 2 mailman mailman 46 Jul 24 2014 master-qrunner.pogo.341
Then I tried to access
http://mail.a2-16v.com/mailman/admindb/a2-16v-list and waited for the 504 error to appear in the browser
[root@pogo chadr]# ll /var/lib/mailman/locks/ total 24K -rw-rw-r-- 2 http mailman 52 Jul 23 2014 a2-16v-list.lock -rw-rw-r-- 1 http mailman 52 Jul 23 03:52 a2-16v-list.lock.pogo.22018.0 -rw-rw-r-- 1 http mailman 52 Jul 23 04:45 a2-16v-list.lock.pogo.22205.0 -rw-rw-r-- 2 http mailman 52 Jul 23 2014 a2-16v-list.lock.pogo.25275.0 -rw-rw-r-- 2 mailman mailman 46 Jul 24 2014 master-qrunner -rw-rw-r-- 2 mailman mailman 46 Jul 24 2014 master-qrunner.pogo.341
No recent updates to the locks or error log
[root@pogo chadr]# tail /var/lib/mailman/logs/locks Jul 23 03:47:28 2014 (21881) File "/usr/lib/mailman/Mailman/Queue/Runner.py", line 190, in _onefile Jul 23 03:47:28 2014 (21881) keepqueued = self._dispose(mlist, msg, msgdata) Jul 23 03:47:28 2014 (21881) File "/usr/lib/mailman/Mailman/Queue/IncomingRunner.py", line 115, in _dispose Jul 23 03:47:28 2014 (21881) mlist.Lock(timeout=mm_cfg.LIST_LOCK_TIMEOUT) Jul 23 03:47:28 2014 (21881) File "/usr/lib/mailman/Mailman/MailList.py", line 161, in Lock Jul 23 03:47:28 2014 (21881) self.__lock.lock(timeout) Jul 23 03:47:28 2014 (21881) File "/usr/lib/mailman/Mailman/LockFile.py", line 306, in lock Jul 23 03:47:28 2014 (21881) important=True) Jul 23 03:47:28 2014 (21881) File "/usr/lib/mailman/Mailman/LockFile.py", line 416, in __writelog Jul 23 03:47:28 2014 (21881) traceback.print_stack(file=logf)
[root@pogo chadr]# tail /var/lib/mailman/logs/error Jun 20 06:32:44 2014 mailmanctl(382): lock = acquire_lock_1(force) Jun 20 06:32:44 2014 mailmanctl(382): File "/usr/lib/mailman/bin/mailmanctl", line 203, in acquire_lock_1 Jun 20 06:32:44 2014 mailmanctl(382): if not force or qrunner_state(): Jun 20 06:32:44 2014 mailmanctl(382): File "/usr/lib/mailman/bin/mailmanctl", line 187, in qrunner_state Jun 20 06:32:44 2014 mailmanctl(382): os.kill(pid, 0) Jun 20 06:32:44 2014 mailmanctl(382): OSError : [Errno 1] Operation not permitted Jun 21 01:24:28 2014 (3941) No such list "a2-16v-list<": Jul 20 17:47:52 2014 mailmanctl(20690): The master qrunner lock could not be acquired because it appears as if another master qrunner is already running. Jul 20 17:47:52 2014 mailmanctl(20690):
[root@pogo chadr]# date Wed Jul 23 09:51:56 EDT 2014
On Wed, Jul 23, 2014 at 12:00 AM, Mark Sapiro <mark@msapiro.net> wrote:

Here is additional information. The log entries for "lifetime has expired" started happening much more frequently after upgrading to 2.1.18-1.
This is the most recent complete log entry for the "lifetime has expired":
Jul 23 03:47:28 2014 (21881) a2-16v-list.lock lifetime has expired, breaking Jul 23 03:47:28 2014 (21881) File "/usr/lib/mailman/bin/qrunner", line 278, in <module> Jul 23 03:47:28 2014 (21881) main() Jul 23 03:47:28 2014 (21881) File "/usr/lib/mailman/bin/qrunner", line 238, in main Jul 23 03:47:28 2014 (21881) qrunner.run() Jul 23 03:47:28 2014 (21881) File "/usr/lib/mailman/Mailman/Queue/Runner.py", line 70, in run Jul 23 03:47:28 2014 (21881) filecnt = self._oneloop() Jul 23 03:47:28 2014 (21881) File "/usr/lib/mailman/Mailman/Queue/Runner.py", line 119, in _oneloop Jul 23 03:47:28 2014 (21881) self._onefile(msg, msgdata) Jul 23 03:47:28 2014 (21881) File "/usr/lib/mailman/Mailman/Queue/Runner.py", line 190, in _onefile Jul 23 03:47:28 2014 (21881) keepqueued = self._dispose(mlist, msg, msgdata) Jul 23 03:47:28 2014 (21881) File "/usr/lib/mailman/Mailman/Queue/IncomingRunner.py", line 115, in _dispose Jul 23 03:47:28 2014 (21881) mlist.Lock(timeout=mm_cfg.LIST_LOCK_TIMEOUT) Jul 23 03:47:28 2014 (21881) File "/usr/lib/mailman/Mailman/MailList.py", line 161, in Lock Jul 23 03:47:28 2014 (21881) self.__lock.lock(timeout) Jul 23 03:47:28 2014 (21881) File "/usr/lib/mailman/Mailman/LockFile.py", line 306, in lock Jul 23 03:47:28 2014 (21881) important=True) Jul 23 03:47:28 2014 (21881) File "/usr/lib/mailman/Mailman/LockFile.py", line 416, in __writelog Jul 23 03:47:28 2014 (21881) traceback.print_stack(file=logf)
[root@pogo chadr]# grep lifetime /var/lib/mailman/logs/locks Feb 01 23:05:41 2012 (19141) master-qrunner lifetime has expired, breaking Dec 19 08:48:06 2012 (25356) master-qrunner lifetime has expired, breaking Dec 28 17:54:14 2012 (588) master-qrunner lifetime has expired, breaking Jun 05 03:02:35 2013 (249) master-qrunner lifetime has expired, breaking Sep 09 09:25:40 2013 (235) a2-16v-list.lock lifetime has expired, breaking Dec 14 06:30:53 2013 (251) a2-16v-list.lock lifetime has expired, breaking Feb 05 22:49:50 2014 (247) a2-16v-list.lock lifetime has expired, breaking Mar 02 22:31:50 2014 (189) master-qrunner lifetime has expired, breaking Jun 27 17:05:45 2014 (344) master-qrunner lifetime has expired, breaking
--> this is when mailman was upgraded to 2.1.18-1
Jul 20 18:08:00 2014 (365) a2-16v-list.lock lifetime has expired, breaking Jul 20 23:15:50 2014 (346) a2-16v-list.lock lifetime has expired, breaking Jul 21 13:58:47 2014 (13825) a2-16v-list.lock lifetime has expired, breaking Jul 22 22:01:47 2014 (21735) a2-16v-list.lock lifetime has expired, breaking Jul 23 03:47:28 2014 (21881) a2-16v-list.lock lifetime has expired, breaking
[root@pogo chadr]# grep mailman /var/log/pacman.log [2011-08-04 23:11] Running 'pacman -S mailman' [2011-08-04 23:12] Running 'pacman -S mailman' [2011-08-04 23:15] installed mailman (2.1.14-1) [2012-04-09 18:04] upgraded mailman (2.1.14-1 -> 2.1.14-3) [2012-04-30 15:56] Running 'pacman -S mailman' [2012-04-30 15:57] upgraded mailman (2.1.14-3 -> 2.1.14-3) [2012-04-30 16:01] Running 'pacman -U /var/cache/pacman/pkg/mailman-2.1.14-1-arm.pkg.tar.xz' [2012-04-30 16:02] upgraded mailman (2.1.14-3 -> 2.1.14-1) [2012-04-30 18:03] Running 'pacman -S mailman' [2012-04-30 18:14] Running 'pacman -Syu mailman' [2012-04-30 18:15] Running 'pacman -S mailman' [2012-04-30 18:16] upgraded mailman (2.1.14-1 -> 2.1.14-4) [2012-08-14 09:37] upgraded mailman (2.1.14-4 -> 2.1.15-2) [2013-02-05 03:06] upgraded mailman (2.1.15-2 -> 2.1.15-3) [2014-03-06 12:37] [PACMAN] upgraded mailman (2.1.15-3 -> 2.1.17-1) [2014-07-20 17:37] [PACMAN] Running 'pacman -Syu mailman' [2014-07-20 17:37] [PACMAN] Running 'pacman -S mailman' [2014-07-20 17:38] [PACMAN] upgraded mailman (2.1.17-1 -> 2.1.18.1-1) [root@pogo chadr]#
On Wed, Jul 23, 2014 at 9:54 AM, Chad Rebuck <chadrebuck@gmail.com> wrote:

hi. Nope, didn't touch the configuration after the latest update. Glad Mark led me to the root cause. Working just fine now that I ran the database check.
Great support! On Jul 23, 2014 8:37 PM, "Stephen J. Turnbull" <stephen@xemacs.org> wrote:

On 07/23/2014 06:54 AM, Chad Rebuck wrote:
I installed it with a package on my arch system using
pacman -Syu
Did it ever work?
I don't think the issue is lock contention because it appears that your CGI process was PID 25275 and it acquired the lock. Either that or something is really hosed and the CGI attempts to lock the list twice.
Also, there's something weird about the clock on your system because 4 of the above 6 files have time stamps in the future.
I don't know if there are multiple runner issues or not, and they shouldn't affect CGIs anyway, but see the "You need to stop all Mailman processes and then start Mailman once only." section in the FAQ at <http://wiki.list.org/x/_4A9>. Do this up to the point of starting Mailman, but before starting Mailman remove all files from the /var/lib/mailman/locks/ directory and then try the web interface again without first starting Mailman.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan

It has been months at least (maybe even back to 2013) since I've attempted to access the pending moderator requests, but it has worked before.
I stopped mailman, confirmed no python processes exist, and deleted all files from the locks dir. I could access the admin page, but not the pending moderator requests.
[root@pogo locks]# ll total 8.0K -rw-rw-r-- 2 http mailman 52 Jul 23 2014 a2-16v-list.lock -rw-rw-r-- 2 http mailman 52 Jul 23 2014 a2-16v-list.lock.pogo.25650.0 [root@pogo locks]#
On Wed, Jul 23, 2014 at 11:10 AM, Mark Sapiro <mark@msapiro.net> wrote:

On 07/23/2014 08:39 AM, Chad Rebuck wrote:
First, ignore my prior remark about time stamps in the future. This is expected. It is the way locking determines when a lock expires.
What happens if you do something like the above, but before the CGI times out you look at the locks and then look for the PID?
I.e., if you looked at the locks and saw the above, do
ps -fww 25650
to see some detail about the process that has the lock while it's still there.
It appears that the CGI is not responding for some other reason and when killed by the web server, it leaves the list locked. So locked lists are not the cause of the problem, but another symptom of the problem whatever that might be.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan

Here ya go
[root@pogo locks]# ll total 16K -rw-rw-r-- 2 http mailman 52 Jul 23 2014 a2-16v-list.lock -rw-rw-r-- 2 http mailman 52 Jul 23 2014 a2-16v-list.lock.pogo.25890.0 -rw-rw-r-- 2 mailman mailman 48 Jul 24 2014 master-qrunner -rw-rw-r-- 2 mailman mailman 48 Jul 24 2014 master-qrunner.pogo.25683 [root@pogo locks]# ps -fww 25890 UID PID PPID C STIME TTY STAT TIME CMD http 25890 378 95 12:12 ? R 0:11 /usr/bin/python2 -S /usr/lib/mailman/scripts/driver admindb
On Wed, Jul 23, 2014 at 12:05 PM, Mark Sapiro <mark@msapiro.net> wrote:

On 07/23/2014 09:15 AM, Chad Rebuck wrote:
I have some ideas about debugging, but it's complicated and I won't have time for more detail until late this afternoon, but what happens if you give the shell command
/usr/bin/python2 /usr/lib/mailman/scripts/driver admindb
You should get a "we hit a bug" response and Mailman's error log should have something like
admin(5110): Traceback (most recent call last): admin(5110): File "scripts/driver", line 111, in run_main admin(5110): if not request_method.lower() in ['get', 'post', 'head']: admin(5110): AttributeError: 'NoneType' object has no attribute 'lower'
I expect you'll see this and we need to set some environment variables to get further, but more later ...
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan

Just to provide closure here, after a bit more off list communication, it was determined that the list's request.pck was huge (1.3 MB) with probably thousands of held messages and the admindb CGI was being timed out by the web server before it could process the whole file which was leaving the list locked.
The OP has set the list's max_days_to_hold = 30 which should allow cron/checkdbs to clean things up the next time it runs.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan

Should I delete each lock in order from oldest to newest, excluding the active lock (20307)? I've deleted a few and waited a few minutes between each delete and saw no locks auto disappear yet.
Would I delete a2-16v-list.lock is I make it down the list that far should I leave it alone?
[root@pogo locks]# ls -lsrt total 92 4 -rw-rw-r-- 1 http mailman 52 Jul 20 22:38 a2-16v-list.lock.pogo.20648.0 4 -rw-rw-r-- 1 http mailman 52 Jul 20 22:40 a2-16v-list.lock.pogo.20663.0 4 -rw-rw-r-- 1 http mailman 52 Jul 20 22:44 a2-16v-list.lock.pogo.20674.0 4 -rw-rw-r-- 1 http mailman 52 Jul 20 22:48 a2-16v-list.lock.pogo.20699.0 4 -rw-rw-r-- 1 http mailman 52 Jul 20 22:57 a2-16v-list.lock.pogo.20827.0 4 -rw-rw-r-- 1 http mailman 52 Jul 20 23:16 a2-16v-list.lock.pogo.20980.0 4 -rw-rw-r-- 1 http mailman 52 Jul 20 23:19 a2-16v-list.lock.pogo.20989.0 4 -rw-rw-r-- 1 http mailman 52 Jul 20 23:54 a2-16v-list.lock.pogo.21944.0 4 -rw-rw-r-- 1 http mailman 50 Jul 21 00:16 a2-16v-list.lock.pogo.406.0 4 -rw-rw-r-- 1 http mailman 50 Jul 21 00:20 a2-16v-list.lock.pogo.435.0 4 -rw-rw-r-- 1 http mailman 50 Jul 21 00:31 a2-16v-list.lock.pogo.484.0 4 -rw-rw-r-- 2 http mailman 52 Jul 22 2014 a2-16v-list.lock.pogo.20307.0 4 -rw-rw-r-- 2 http mailman 52 Jul 22 2014 a2-16v-list.lock 4 -rw-rw-r-- 1 http mailman 52 Jul 22 2014 a2-16v-list.lock.pogo.20311.0 4 -rw-rw-r-- 1 http mailman 52 Jul 22 2014 a2-16v-list.lock.pogo.20312.0 4 -rw-rw-r-- 1 http mailman 52 Jul 22 2014 a2-16v-list.lock.pogo.20313.0 4 -rw-rw-r-- 1 http mailman 52 Jul 22 2014 a2-16v-list.lock.pogo.20316.0 4 -rw-rw-r-- 1 http mailman 52 Jul 22 2014 a2-16v-list.lock.pogo.20323.0 4 -rw-rw-r-- 1 http mailman 52 Jul 22 2014 a2-16v-list.lock.pogo.20351.0 4 -rw-rw-r-- 1 http mailman 52 Jul 22 2014 a2-16v-list.lock.pogo.20519.0 4 -rw-rw-r-- 1 http mailman 52 Jul 22 2014 a2-16v-list.lock.pogo.20598.0 4 -rw-rw-r-- 2 mailman mailman 46 Jul 23 2014 master-qrunner.pogo.341 4 -rw-rw-r-- 2 mailman mailman 46 Jul 23 2014 master-qrunner [root@pogo locks]#
On Tue, Jul 22, 2014 at 5:11 PM, Mark Sapiro <mark@msapiro.net> wrote:

On 07/22/2014 02:20 PM, Chad Rebuck wrote:
You can go through the whole list looking for non-existent PIDs and delete all those, or you can delete just the active lock as that is the one preventing the others from proceeding.
I.e., delete the active lock (assuming its PID is gone) and some waiting processes may proceed. In any case, a new lock will become active. Repeat that until no locks remain.
Would I delete a2-16v-list.lock is I make it down the list that far should I leave it alone?
If at some point you are left with only the a2-16v-list.lock and no a2-16v-list.lock.pogo.PID.* files, you can just remove a2-16v-list.lock, or I think if you don't, it will be automatically removed the next time a process locks and unlocks the a2-16v-list list.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan

All the lock files have been deleted as I found no process running for any of them. However, the problem still occurs. What should I attempt next?
Thanks.
On Tue, Jul 22, 2014 at 5:29 PM, Mark Sapiro <mark@msapiro.net> wrote:

On 07/22/2014 07:10 PM, Chad Rebuck wrote:
What is the content of the /var/lib/mailman/locks/ directory now?
Did you remove the a2-16v-list.lock file? If not, you probably need to.
Is the message in the web server error log the same as before?
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan

I didn't remove it until now, but I saw it seemed to update itself when I tried to access the admin page.
[root@pogo locks]# ll total 16K -rw-rw-r-- 1 http mailman 52 Jul 23 2014 a2-16v-list.lock -rw-rw-r-- 1 http mailman 52 Jul 23 2014 a2-16v-list.lock.pogo.21765.0 -rw-rw-r-- 2 mailman mailman 46 Jul 24 2014 master-qrunner -rw-rw-r-- 2 mailman mailman 46 Jul 24 2014 master-qrunner.pogo.341
[root@pogo locks]# rm -rf a2-16v-list.lock*
[root@pogo locks]# ll total 8.0K -rw-rw-r-- 2 mailman mailman 46 Jul 24 2014 master-qrunner -rw-rw-r-- 2 mailman mailman 46 Jul 24 2014 master-qrunner.pogo.341
I was then able to access the admin page, but when I clicked on "tend to pending requests" I got the 504 error after a couple minutes
[root@pogo locks]# tail -2 /var/log/httpd/error_log [Tue Jul 22 22:22:30.721677 2014] [cgi:error] [pid 378:tid 1143579696] [client 144.160.5.25:34507] Script timed out before returning headers: admindb, referer: http://mail.a2-16v.com/mailman/admin/a2-16v-list [Tue Jul 22 22:23:30.781402 2014] [cgi:warn] [pid 378:tid 1143579696] [client 144.160.5.25:34507] AH01220: Timeout waiting for output from CGI script /usr/lib/mailman/cgi-bin/admindb, referer: http://mail.a2-16v.com/mailman/admin/a2-16v-list [root@pogo locks]# tail -2 /var/log/httpd/access_log 144.160.5.25 - - [22/Jul/2014:22:21:14 -0400] "GET /mailman/admin/a2-16v-list HTTP/1.1" 200 20748 144.160.5.25 - - [22/Jul/2014:22:21:30 -0400] "GET /mailman/admindb/a2-16v-list HTTP/1.1" 504 247
now the lock folder is as follows
[root@pogo locks]# ll total 16K -rw-rw-r-- 2 http mailman 52 Jul 23 2014 a2-16v-list.lock -rw-rw-r-- 2 http mailman 52 Jul 23 2014 a2-16v-list.lock.pogo.21828.0 -rw-rw-r-- 2 mailman mailman 46 Jul 24 2014 master-qrunner -rw-rw-r-- 2 mailman mailman 46 Jul 24 2014 master-qrunner.pogo.341 [root@pogo locks]#
On Tue, Jul 22, 2014 at 10:14 PM, Mark Sapiro <mark@msapiro.net> wrote:

On 07/22/2014 07:25 PM, Chad Rebuck wrote:
So PID 21828 has the list lock. It looks like the admindb CGI, PID 378 timed out, presumably because it couldn't get the lock, but its request would be a file named a2-16v-list.lock.pogo.378.0, and that doesn't seem to be there.
What if anything is in Mailman's 'error' log and 'locks' log?
Locking may be the underlying issue, but it's not just stale locks because even after they're cleared, you still have issues.
How did you install 2.1.18-1?
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan

I installed it with a package on my arch system using
pacman -Syu
[root@pogo chadr]# ll /var/lib/mailman/locks/ total 16K -rw-rw-r-- 1 http mailman 52 Jul 23 03:52 a2-16v-list.lock.pogo.22018.0 -rw-rw-r-- 1 http mailman 52 Jul 23 04:45 a2-16v-list.lock.pogo.22205.0 -rw-rw-r-- 2 mailman mailman 46 Jul 24 2014 master-qrunner -rw-rw-r-- 2 mailman mailman 46 Jul 24 2014 master-qrunner.pogo.341
Then I tried to access
http://mail.a2-16v.com/mailman/admindb/a2-16v-list and waited for the 504 error to appear in the browser
[root@pogo chadr]# ll /var/lib/mailman/locks/ total 24K -rw-rw-r-- 2 http mailman 52 Jul 23 2014 a2-16v-list.lock -rw-rw-r-- 1 http mailman 52 Jul 23 03:52 a2-16v-list.lock.pogo.22018.0 -rw-rw-r-- 1 http mailman 52 Jul 23 04:45 a2-16v-list.lock.pogo.22205.0 -rw-rw-r-- 2 http mailman 52 Jul 23 2014 a2-16v-list.lock.pogo.25275.0 -rw-rw-r-- 2 mailman mailman 46 Jul 24 2014 master-qrunner -rw-rw-r-- 2 mailman mailman 46 Jul 24 2014 master-qrunner.pogo.341
No recent updates to the locks or error log
[root@pogo chadr]# tail /var/lib/mailman/logs/locks Jul 23 03:47:28 2014 (21881) File "/usr/lib/mailman/Mailman/Queue/Runner.py", line 190, in _onefile Jul 23 03:47:28 2014 (21881) keepqueued = self._dispose(mlist, msg, msgdata) Jul 23 03:47:28 2014 (21881) File "/usr/lib/mailman/Mailman/Queue/IncomingRunner.py", line 115, in _dispose Jul 23 03:47:28 2014 (21881) mlist.Lock(timeout=mm_cfg.LIST_LOCK_TIMEOUT) Jul 23 03:47:28 2014 (21881) File "/usr/lib/mailman/Mailman/MailList.py", line 161, in Lock Jul 23 03:47:28 2014 (21881) self.__lock.lock(timeout) Jul 23 03:47:28 2014 (21881) File "/usr/lib/mailman/Mailman/LockFile.py", line 306, in lock Jul 23 03:47:28 2014 (21881) important=True) Jul 23 03:47:28 2014 (21881) File "/usr/lib/mailman/Mailman/LockFile.py", line 416, in __writelog Jul 23 03:47:28 2014 (21881) traceback.print_stack(file=logf)
[root@pogo chadr]# tail /var/lib/mailman/logs/error Jun 20 06:32:44 2014 mailmanctl(382): lock = acquire_lock_1(force) Jun 20 06:32:44 2014 mailmanctl(382): File "/usr/lib/mailman/bin/mailmanctl", line 203, in acquire_lock_1 Jun 20 06:32:44 2014 mailmanctl(382): if not force or qrunner_state(): Jun 20 06:32:44 2014 mailmanctl(382): File "/usr/lib/mailman/bin/mailmanctl", line 187, in qrunner_state Jun 20 06:32:44 2014 mailmanctl(382): os.kill(pid, 0) Jun 20 06:32:44 2014 mailmanctl(382): OSError : [Errno 1] Operation not permitted Jun 21 01:24:28 2014 (3941) No such list "a2-16v-list<": Jul 20 17:47:52 2014 mailmanctl(20690): The master qrunner lock could not be acquired because it appears as if another master qrunner is already running. Jul 20 17:47:52 2014 mailmanctl(20690):
[root@pogo chadr]# date Wed Jul 23 09:51:56 EDT 2014
On Wed, Jul 23, 2014 at 12:00 AM, Mark Sapiro <mark@msapiro.net> wrote:

Here is additional information. The log entries for "lifetime has expired" started happening much more frequently after upgrading to 2.1.18-1.
This is the most recent complete log entry for the "lifetime has expired":
Jul 23 03:47:28 2014 (21881) a2-16v-list.lock lifetime has expired, breaking Jul 23 03:47:28 2014 (21881) File "/usr/lib/mailman/bin/qrunner", line 278, in <module> Jul 23 03:47:28 2014 (21881) main() Jul 23 03:47:28 2014 (21881) File "/usr/lib/mailman/bin/qrunner", line 238, in main Jul 23 03:47:28 2014 (21881) qrunner.run() Jul 23 03:47:28 2014 (21881) File "/usr/lib/mailman/Mailman/Queue/Runner.py", line 70, in run Jul 23 03:47:28 2014 (21881) filecnt = self._oneloop() Jul 23 03:47:28 2014 (21881) File "/usr/lib/mailman/Mailman/Queue/Runner.py", line 119, in _oneloop Jul 23 03:47:28 2014 (21881) self._onefile(msg, msgdata) Jul 23 03:47:28 2014 (21881) File "/usr/lib/mailman/Mailman/Queue/Runner.py", line 190, in _onefile Jul 23 03:47:28 2014 (21881) keepqueued = self._dispose(mlist, msg, msgdata) Jul 23 03:47:28 2014 (21881) File "/usr/lib/mailman/Mailman/Queue/IncomingRunner.py", line 115, in _dispose Jul 23 03:47:28 2014 (21881) mlist.Lock(timeout=mm_cfg.LIST_LOCK_TIMEOUT) Jul 23 03:47:28 2014 (21881) File "/usr/lib/mailman/Mailman/MailList.py", line 161, in Lock Jul 23 03:47:28 2014 (21881) self.__lock.lock(timeout) Jul 23 03:47:28 2014 (21881) File "/usr/lib/mailman/Mailman/LockFile.py", line 306, in lock Jul 23 03:47:28 2014 (21881) important=True) Jul 23 03:47:28 2014 (21881) File "/usr/lib/mailman/Mailman/LockFile.py", line 416, in __writelog Jul 23 03:47:28 2014 (21881) traceback.print_stack(file=logf)
[root@pogo chadr]# grep lifetime /var/lib/mailman/logs/locks Feb 01 23:05:41 2012 (19141) master-qrunner lifetime has expired, breaking Dec 19 08:48:06 2012 (25356) master-qrunner lifetime has expired, breaking Dec 28 17:54:14 2012 (588) master-qrunner lifetime has expired, breaking Jun 05 03:02:35 2013 (249) master-qrunner lifetime has expired, breaking Sep 09 09:25:40 2013 (235) a2-16v-list.lock lifetime has expired, breaking Dec 14 06:30:53 2013 (251) a2-16v-list.lock lifetime has expired, breaking Feb 05 22:49:50 2014 (247) a2-16v-list.lock lifetime has expired, breaking Mar 02 22:31:50 2014 (189) master-qrunner lifetime has expired, breaking Jun 27 17:05:45 2014 (344) master-qrunner lifetime has expired, breaking
--> this is when mailman was upgraded to 2.1.18-1
Jul 20 18:08:00 2014 (365) a2-16v-list.lock lifetime has expired, breaking Jul 20 23:15:50 2014 (346) a2-16v-list.lock lifetime has expired, breaking Jul 21 13:58:47 2014 (13825) a2-16v-list.lock lifetime has expired, breaking Jul 22 22:01:47 2014 (21735) a2-16v-list.lock lifetime has expired, breaking Jul 23 03:47:28 2014 (21881) a2-16v-list.lock lifetime has expired, breaking
[root@pogo chadr]# grep mailman /var/log/pacman.log [2011-08-04 23:11] Running 'pacman -S mailman' [2011-08-04 23:12] Running 'pacman -S mailman' [2011-08-04 23:15] installed mailman (2.1.14-1) [2012-04-09 18:04] upgraded mailman (2.1.14-1 -> 2.1.14-3) [2012-04-30 15:56] Running 'pacman -S mailman' [2012-04-30 15:57] upgraded mailman (2.1.14-3 -> 2.1.14-3) [2012-04-30 16:01] Running 'pacman -U /var/cache/pacman/pkg/mailman-2.1.14-1-arm.pkg.tar.xz' [2012-04-30 16:02] upgraded mailman (2.1.14-3 -> 2.1.14-1) [2012-04-30 18:03] Running 'pacman -S mailman' [2012-04-30 18:14] Running 'pacman -Syu mailman' [2012-04-30 18:15] Running 'pacman -S mailman' [2012-04-30 18:16] upgraded mailman (2.1.14-1 -> 2.1.14-4) [2012-08-14 09:37] upgraded mailman (2.1.14-4 -> 2.1.15-2) [2013-02-05 03:06] upgraded mailman (2.1.15-2 -> 2.1.15-3) [2014-03-06 12:37] [PACMAN] upgraded mailman (2.1.15-3 -> 2.1.17-1) [2014-07-20 17:37] [PACMAN] Running 'pacman -Syu mailman' [2014-07-20 17:37] [PACMAN] Running 'pacman -S mailman' [2014-07-20 17:38] [PACMAN] upgraded mailman (2.1.17-1 -> 2.1.18.1-1) [root@pogo chadr]#
On Wed, Jul 23, 2014 at 9:54 AM, Chad Rebuck <chadrebuck@gmail.com> wrote:

hi. Nope, didn't touch the configuration after the latest update. Glad Mark led me to the root cause. Working just fine now that I ran the database check.
Great support! On Jul 23, 2014 8:37 PM, "Stephen J. Turnbull" <stephen@xemacs.org> wrote:

On 07/23/2014 06:54 AM, Chad Rebuck wrote:
I installed it with a package on my arch system using
pacman -Syu
Did it ever work?
I don't think the issue is lock contention because it appears that your CGI process was PID 25275 and it acquired the lock. Either that or something is really hosed and the CGI attempts to lock the list twice.
Also, there's something weird about the clock on your system because 4 of the above 6 files have time stamps in the future.
I don't know if there are multiple runner issues or not, and they shouldn't affect CGIs anyway, but see the "You need to stop all Mailman processes and then start Mailman once only." section in the FAQ at <http://wiki.list.org/x/_4A9>. Do this up to the point of starting Mailman, but before starting Mailman remove all files from the /var/lib/mailman/locks/ directory and then try the web interface again without first starting Mailman.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan

It has been months at least (maybe even back to 2013) since I've attempted to access the pending moderator requests, but it has worked before.
I stopped mailman, confirmed no python processes exist, and deleted all files from the locks dir. I could access the admin page, but not the pending moderator requests.
[root@pogo locks]# ll total 8.0K -rw-rw-r-- 2 http mailman 52 Jul 23 2014 a2-16v-list.lock -rw-rw-r-- 2 http mailman 52 Jul 23 2014 a2-16v-list.lock.pogo.25650.0 [root@pogo locks]#
On Wed, Jul 23, 2014 at 11:10 AM, Mark Sapiro <mark@msapiro.net> wrote:

On 07/23/2014 08:39 AM, Chad Rebuck wrote:
First, ignore my prior remark about time stamps in the future. This is expected. It is the way locking determines when a lock expires.
What happens if you do something like the above, but before the CGI times out you look at the locks and then look for the PID?
I.e., if you looked at the locks and saw the above, do
ps -fww 25650
to see some detail about the process that has the lock while it's still there.
It appears that the CGI is not responding for some other reason and when killed by the web server, it leaves the list locked. So locked lists are not the cause of the problem, but another symptom of the problem whatever that might be.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan

Here ya go
[root@pogo locks]# ll total 16K -rw-rw-r-- 2 http mailman 52 Jul 23 2014 a2-16v-list.lock -rw-rw-r-- 2 http mailman 52 Jul 23 2014 a2-16v-list.lock.pogo.25890.0 -rw-rw-r-- 2 mailman mailman 48 Jul 24 2014 master-qrunner -rw-rw-r-- 2 mailman mailman 48 Jul 24 2014 master-qrunner.pogo.25683 [root@pogo locks]# ps -fww 25890 UID PID PPID C STIME TTY STAT TIME CMD http 25890 378 95 12:12 ? R 0:11 /usr/bin/python2 -S /usr/lib/mailman/scripts/driver admindb
On Wed, Jul 23, 2014 at 12:05 PM, Mark Sapiro <mark@msapiro.net> wrote:

On 07/23/2014 09:15 AM, Chad Rebuck wrote:
I have some ideas about debugging, but it's complicated and I won't have time for more detail until late this afternoon, but what happens if you give the shell command
/usr/bin/python2 /usr/lib/mailman/scripts/driver admindb
You should get a "we hit a bug" response and Mailman's error log should have something like
admin(5110): Traceback (most recent call last): admin(5110): File "scripts/driver", line 111, in run_main admin(5110): if not request_method.lower() in ['get', 'post', 'head']: admin(5110): AttributeError: 'NoneType' object has no attribute 'lower'
I expect you'll see this and we need to set some environment variables to get further, but more later ...
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan

Just to provide closure here, after a bit more off list communication, it was determined that the list's request.pck was huge (1.3 MB) with probably thousands of held messages and the admindb CGI was being timed out by the web server before it could process the whole file which was leaving the list locked.
The OP has set the list's max_days_to_hold = 30 which should allow cron/checkdbs to clean things up the next time it runs.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
participants (3)
-
Chad Rebuck
-
Mark Sapiro
-
Stephen J. Turnbull