Further to "Need Help with Mailman Mail Delivery"

Not sure this is relevant but I see this in the qrunner log:
Sep 26 01:03:28 2016 (12454) Qrunner VirginRunner reached maximum restart limit of 10, not restarting.
(And a bunch of similar messages.)

On 09/26/2016 06:27 AM, Chuck Weinstock wrote:
It is absolutely relevant, but it contradicts your prior "All of the qrunners etc. are running." statement.
It says that VirginRunner encountered a fatal error, died and was restarted 10 times and the master (mailmanctl) has given up on it.
What is the sig and sts from messages in the qrunner log like
Master qrunner detected subprocess exit (pid: 5651, sig: None, sts: 15, class: RetryRunner, slice: 1/1)
and what's in Mailman's error log from the same times that qrunners are dying.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan

When I looked originally all of the qrunners were running.
I believe I have solved my problem by nuking the old install of Mailman and reinstalling from scratch (and reusing the config.pck, etc files.)
If you are interested I can still supply the information asked for below, but I won’t waste your time otherwise.
I appreciate the response.
Chuck

On 09/26/2016 06:28 PM, Chuck Weinstock wrote:
When I looked originally all of the qrunners were running.
Probably because it took a message to trigger the exception and there hadn't at that point been enough messages to hit the retry limit.
I believe I have solved my problem by nuking the old install of Mailman and reinstalling from scratch (and reusing the config.pck, etc files.)
If you are interested I can still supply the information asked for below, but I won’t waste your time otherwise.
If you're satisfied that you've solved the issue, I'm happy.
Thanks for offering, but I only wanted that info to help you find a solution.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan

Whoops. The reinstalled Mailman stopped working with the same problem overnight. Two of the eight qrunners crashed. I have 3-4 lists and one of them will not open in the web admin interface. It times out as per the apache log: [Tue Sep 27 09:45:53.591373 2016] [cgi:warn] [pid 2483] [client 128.237.211.152:49581] AH01220: Timeout waiting for output from CGI script /usr/lib/mailman/cgi-bin/admin, referer: http://www.conjel.co/mailman/admin/fttc [Tue Sep 27 09:45:53.592426 2016] [cgi:error] [pid 2483] [client 128.237.211.152:49581] Script timed out before returning headers: admin, referer: http://www.conjel.co/mailman/admin/fttc [Tue Sep 27 09:46:53.639699 2016] [cgi:warn] [pid 2483] [client 128.237.211.152:49581] AH01220: Timeout waiting for output from CGI script /usr/lib/mailman/cgi-bin/admin, referer: http://www.conjel.co/mailman/admin/fttc [Tue Sep 27 09:46:53.640524 2016] [reqtimeout:info] [pid 2483] [client 128.237.211.152:49581] AH01382: Request body read timeout Here is the access log from the same time frame: 128.237.211.152 - - [27/Sep/2016:09:44:51 -0400] "GET /mailman/admin/fttc HTTP/1.1" 200 2078 128.237.211.152 - - [27/Sep/2016:09:44:53 -0400] "POST /mailman/admin/fttc HTTP/1.1" 504 247 Here is the qrunner log (from earlier when the two qrunners stopped): Sep 27 06:09:59 2016 (7136) Master qrunner detected subprocess exit (pid: 1194, sig: 9, sts: None, class: VirginRunner, slice: 1/1) [restarting] Sep 27 06:09:59 2016 (1439) VirginRunner qrunner started. Sep 27 06:13:22 2016 (7136) Master qrunner detected subprocess exit (pid: 1246, sig: 9, sts: None, class: IncomingRunner, slice: 1/1) [restarting] Sep 27 06:13:23 2016 (1564) IncomingRunner qrunner started. Sep 27 06:15:09 2016 (7136) Master qrunner detected subprocess exit (pid: 1439, sig: 9, sts: None, class: VirginRunner, slice: 1/1) [restarting] Sep 27 06:15:09 2016 (1679) VirginRunner qrunner started. Sep 27 06:18:00 2016 (7136) Master qrunner detected subprocess exit (pid: 1564, sig: 9, sts: None, class: IncomingRunner, slice: 1/1) [restarting] Sep 27 06:18:00 2016 (1786) IncomingRunner qrunner started. Sep 27 06:20:30 2016 (7136) Master qrunner detected subprocess exit (pid: 1679, sig: 9, sts: None, class: VirginRunner, slice: 1/1) [restarting] Sep 27 06:20:31 2016 (1917) VirginRunner qrunner started. Sep 27 06:21:56 2016 (7136) Master qrunner detected subprocess exit (pid: 1786, sig: 9, sts: None, class: IncomingRunner, slice: 1/1) [restarting] Sep 27 06:21:56 2016 (1980) IncomingRunner qrunner started. Sep 27 06:24:28 2016 (7136) Master qrunner detected subprocess exit (pid: 1917, sig: 9, sts: None, class: VirginRunner, slice: 1/1) [restarting] Sep 27 06:24:29 2016 (2048) VirginRunner qrunner started. Sep 27 06:25:55 2016 (7136) Master qrunner detected subprocess exit (pid: 1980, sig: 9, sts: None, class: IncomingRunner, slice: 1/1) [restarting] Sep 27 06:25:56 2016 (2160) IncomingRunner qrunner started. Sep 27 06:28:06 2016 (7136) Master qrunner detected subprocess exit (pid: 2048, sig: 9, sts: None, class: VirginRunner, slice: 1/1) [restarting] Sep 27 06:28:06 2016 (2223) VirginRunner qrunner started. Sep 27 06:30:03 2016 (7136) Master qrunner detected subprocess exit (pid: 2160, sig: 9, sts: None, class: IncomingRunner, slice: 1/1) [restarting] Sep 27 06:30:03 2016 (2317) IncomingRunner qrunner started. Sep 27 06:32:36 2016 (7136) Master qrunner detected subprocess exit (pid: 2223, sig: 9, sts: None, class: VirginRunner, slice: 1/1) [restarting] Sep 27 06:32:37 2016 (2443) VirginRunner qrunner started. Sep 27 06:34:03 2016 (7136) Master qrunner detected subprocess exit (pid: 2317, sig: 9, sts: None, class: IncomingRunner, slice: 1/1) [restarting] Sep 27 06:34:04 2016 (2494) IncomingRunner qrunner started. Sep 27 06:36:44 2016 (7136) Master qrunner detected subprocess exit (pid: 2443, sig: 9, sts: None, class: VirginRunner, slice: 1/1) [restarting] Sep 27 06:36:44 2016 (7136) Qrunner VirginRunner reached maximum restart limit of 10, not restarting. Sep 27 06:45:04 2016 (7136) Master qrunner detected subprocess exit (pid: 2494, sig: 9, sts: None, class: IncomingRunner, slice: 1/1) [restarting] Sep 27 06:45:04 2016 (7136) Qrunner IncomingRunner reached maximum restart limit of 10, not restarting. Finally this is the only error in the Mailman error file since the reinstall last night. Sep 26 20:59:51 2016 admin(8885): @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ admin(8885): [----- Mailman Version: 2.1.15 -----] admin(8885): [----- Traceback ------] admin(8885): Traceback (most recent call last): admin(8885): File "/usr/lib/mailman/scripts/driver", line 112, in run_main admin(8885): main() admin(8885): File "/usr/lib/mailman/Mailman/Cgi/admindb.py", line 198, in main admin(8885): mlist.Save() admin(8885): File "/usr/lib/mailman/Mailman/MailList.py", line 578, in Save admin(8885): self.__save(dict) admin(8885): File "/usr/lib/mailman/Mailman/MailList.py", line 555, in __save admin(8885): os.link(fname, fname_last) admin(8885): OSError: [Errno 1] Operation not permitted admin(8885): [----- Python Information -----] admin(8885): sys.version = 2.7.5 (default, Sep 15 2016, 22:37:39) [GCC 4.8.5 20150623 (Red Hat 4.8.5-4)] admin(8885): sys.executable = /usr/bin/python admin(8885): sys.prefix = /usr admin(8885): sys.exec_prefix = /usr admin(8885): sys.path = ['/usr/lib/mailman/pythonlib', '/usr/lib/mailman', '/usr/lib/mailman/scripts', '/usr/lib/mailman', '/usr/li b64/python27.zip', '/usr/lib64/python2.7/', '/usr/lib64/python2.7/plat-linux2', '/usr/lib64/python2.7/lib-tk', '/usr/lib64/python2.7/lib-old ', '/usr/lib64/python2.7/lib-dynload', '/usr/lib/python2.7/site-packages'] admin(8885): sys.platform = linux2 admin(8885): [----- Environment Variables -----] admin(8885): HTTP_REFERER: http://conjel.co/mailman/admindb/dsn admin(8885): CONTEXT_DOCUMENT_ROOT: /usr/lib/mailman/cgi-bin/ admin(8885): SERVER_SOFTWARE: Apache/2.4.6 (CentOS) OpenSSL/1.0.1e-fips PHP/5.4.16 admin(8885): CONTEXT_PREFIX: /mailman/ admin(8885): SERVER_SIGNATURE: admin(8885): REQUEST_METHOD: POST admin(8885): PATH_INFO: /dsn admin(8885): HTTP_ORIGIN: http://conjel.co admin(8885): SERVER_PROTOCOL: HTTP/1.1 admin(8885): QUERY_STRING: admin(8885): CONTENT_LENGTH: 39 admin(8885): HTTP_USER_AGENT: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36 admin(8885): HTTP_CONNECTION: keep-alive admin(8885): HTTP_COOKIE: mailman+admin=280200000069afc0e95773280000003665333231613538636235383833376661383331666565643265333961653063313 3366130663062 admin(8885): SERVER_NAME: conjel.co admin(8885): REMOTE_ADDR: 2601:547:f00:cf2c:8c4a:63df:fcba:58e9 admin(8885): PATH_TRANSLATED: /home/personal/htdocs/dsn admin(8885): SERVER_PORT: 80 admin(8885): SERVER_ADDR: 2001:4800:7818:103:be76:4eff:fe04:5321 admin(8885): DOCUMENT_ROOT: /home/personal/htdocs admin(8885): PYTHONPATH: /usr/lib/mailman admin(8885): SCRIPT_FILENAME: /usr/lib/mailman/cgi-bin/admindb admin(8885): SERVER_ADMIN: root@localhost admin(8885): HTTP_HOST: conjel.co admin(8885): SCRIPT_NAME: /mailman/admindb admin(8885): HTTP_UPGRADE_INSECURE_REQUESTS: 1 admin(8885): HTTP_CACHE_CONTROL: max-age=0 admin(8885): REQUEST_URI: /mailman/admindb/dsn admin(8885): HTTP_ACCEPT: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8 admin(8885): GATEWAY_INTERFACE: CGI/1.1 admin(8885): REMOTE_PORT: 63197 admin(8885): HTTP_ACCEPT_LANGUAGE: en-US,en;q=0.8 admin(8885): REQUEST_SCHEME: http admin(8885): CONTENT_TYPE: application/x-www-form-urlencoded admin(8885): HTTP_ACCEPT_ENCODING: gzip, deflate admin(8885): UNIQUE_ID: V@nEh3AeyVpBSf2Pn@BbogAAAAI

On 09/27/2016 06:55 AM, Chuck Weinstock wrote:
The CGIs are timing out. This is normally caused by a locked list.
sig: 9 is a SIGKILL. This seems to say that something external is killing the runner. This is likely the same or a similar underlying cause as the CGI timeouts, but is different as the CGIs are independent of the qrunners.
This is a permission or security manager (SELinux, apparmor, ?) issue. First try running Mailman's 'bin/check_perms -f` as root. If that fixes things, it may help. Also, see <https://wiki.list.org/x/17891756>. Note that Mailman's CGI wrappers must be group mailman and SETGID. In particular, these files must not be on a file system mounted with 'nosuid'. If none of this helps, try disabling SELinux. The qrunners being SIGKILLed is still a bit mysterious, but that could be related to a permissions or SELinux issue. -- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan

Mark, I referred to https://wiki.list.org/x/17891756 <https://wiki.list.org/x/17891756> before I even contacted the list. All of the cgi wrappers are suid. check_perms run as root finds no problems. One thing I noticed is that there was no locks directory anywhere in the installation. Is this normal? (Places I looked: /var/lib/mailman /usr/lib/mailman and /etc/mailman.) Also the /var/log/mailman/lock was empty but now shows: Sep 27 10:51:45 2016 (12700) fttc.lock lifetime has expired, breaking Sep 27 10:51:45 2016 (12700) File "/usr/lib/mailman/bin/qrunner", line 278, in <module> Sep 27 10:51:45 2016 (12700) main() Sep 27 10:51:45 2016 (12700) File "/usr/lib/mailman/bin/qrunner", line 238, in main Sep 27 10:51:45 2016 (12700) qrunner.run() Sep 27 10:51:45 2016 (12700) File "/usr/lib/mailman/Mailman/Queue/Runner.py", line 70, in run Sep 27 10:51:45 2016 (12700) filecnt = self._oneloop() Sep 27 10:51:45 2016 (12700) File "/usr/lib/mailman/Mailman/Queue/Runner.py", line 119, in _oneloop Sep 27 10:51:45 2016 (12700) self._onefile(msg, msgdata) Sep 27 10:51:45 2016 (12700) File "/usr/lib/mailman/Mailman/Queue/Runner.py", line 190, in _onefile Sep 27 10:51:45 2016 (12700) keepqueued = self._dispose(mlist, msg, msgdata) Sep 27 10:51:45 2016 (12700) File "/usr/lib/mailman/Mailman/Queue/IncomingRunner.py", line 115, in _dispose Sep 27 10:51:45 2016 (12700) mlist.Lock(timeout=mm_cfg.LIST_LOCK_TIMEOUT) Sep 27 10:51:45 2016 (12700) File "/usr/lib/mailman/Mailman/MailList.py", line 161, in Lock Sep 27 10:51:45 2016 (12700) self.__lock.lock(timeout) Sep 27 10:51:45 2016 (12700) File "/usr/lib/mailman/Mailman/LockFile.py", line 306, in lock Sep 27 10:51:45 2016 (12700) important=True) Sep 27 10:51:45 2016 (12700) File "/usr/lib/mailman/Mailman/LockFile.py", line 416, in __writelog Sep 27 10:51:45 2016 (12700) traceback.print_stack(file=logf) (Which is referencing the list in question.) Thanks again, Chuck

On 09/27/2016 07:54 AM, Chuck Weinstock wrote:
Your mailman locks directory is /var/lock/mailman/. See <https://wiki.list.org/x/8486953>.
I suspect the problem with the CGIs has to do with the qrunners being KILLed and leaving the list locked.
That still doesn't explain why the qrunners are being SIGKILLed.
Is there anything in /var/spool/mailman/shunt, /var/spool/mailman/retry or /var/spool/mailman/in?
If so, what does Mailman's 'bin/dumpdb -p' produce on those files? I'm looking for some kind of message corruption and also the metadata.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan

Ok, there were a whole bunch of lock files that I cleared and now I can access the admin page.
The shunt directory contains 591 different lines of the form:
-rw-rw---- 1 mailman mailman 1196 Sep 26 20:25 1474919192.85809+72fc7bd3e80b280fe5def1b842936d832e60126f.pck
All from around the same time.
The other two directories are empty.
Applying dumpdb to a few of them I see that they are each spam. (No indication of corruption that I can see.)
Here’s part of an example:
[----- start pickle file -----] <----- start object 1 -----> From Cora4@only-4u.com Thu Sep 22 08:58:08 2016 Return-Path: <Cora4@only-4u.com> X-Original-To: wg10.4@dependability.org Delivered-To: wg10.4@dependability.org Received: from [103.206.131.149] (unknown [103.206.131.149]) by personal2.localdomain (Postfix) with ESMTP id ED0984762 for <wg10.4@dependability.org>; Thu, 22 Sep 2016 08:58:07 -0400 (EDT) Content-type: multipart/mixed; boundary=Apple-Mail-D4D73760-E06639FC-B9B1373-86986B42-2365F1462A7E2E3657B Content-transfer-encoding: 7bit From: "Cora wilkes" <Cora4@only-4u.com> MIME-version: 1.0 (1.0) Date: Thu, 22 Sep 2016 18:28:05 +0530 Subject: Invoice INV00001226 Message-id: <2BD4D6-22C09DB0-F12148EE-B60C5033-85FB1AE4E06B6427C130@only-4u.com> To: wg10.4@dependability.org X-Mailer: iPhone Mail (13G35) Envelope-To: <wg10.4@dependability.org>
--Apple-Mail-D4D73760-E06639FC-B9B1373-86986B42-2365F1462A7E2E3657B Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit
Please find our invoice attached.
--Apple-Mail-D4D73760-E06639FC-B9B1373-86986B42-2365F1462A7E2E3657B Content-Type: application/zip; name=Invoice_INV00001226.zip; x-apple-part-url=D654925-14D6BA-BC29666A-033E-AB15C1B5891F982D3F0C Content-Disposition: attachment; filename=Invoice_INV00001226.zip Content-Transfer-Encoding: base64
[base64 content]
--Apple-Mail-D4D73760-E06639FC-B9B1373-86986B42-2365F1462A7E2E3657B--
<----- start object 2 -----> { '_parsemsg': False, 'listname': 'wg10.4', 'received_time': 1474549089.019715, 'tolist': 1, 'version': 3} [----- end pickle file -----]

On 09/27/2016 08:20 AM, Chuck Weinstock wrote:
Ok, there were a whole bunch of lock files that I cleared and now I can access the admin page.
OK.
Maybe you were hit with a massive spam attack and that triggered something that caused the qrunners to die, but there normally would also be error log messages. Did you check for a rotated log?
If these are older shunt entries (before you reinstalled), possibly there was a permissions issue at that time that prevented writing the error log and maybe caused other issues. The one you posted had a received time from last Thursday, 22 Sept.
The spam message entries in the shunt queue can just be deleted.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan

On 09/26/2016 06:27 AM, Chuck Weinstock wrote:
It is absolutely relevant, but it contradicts your prior "All of the qrunners etc. are running." statement.
It says that VirginRunner encountered a fatal error, died and was restarted 10 times and the master (mailmanctl) has given up on it.
What is the sig and sts from messages in the qrunner log like
Master qrunner detected subprocess exit (pid: 5651, sig: None, sts: 15, class: RetryRunner, slice: 1/1)
and what's in Mailman's error log from the same times that qrunners are dying.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan

When I looked originally all of the qrunners were running.
I believe I have solved my problem by nuking the old install of Mailman and reinstalling from scratch (and reusing the config.pck, etc files.)
If you are interested I can still supply the information asked for below, but I won’t waste your time otherwise.
I appreciate the response.
Chuck

On 09/26/2016 06:28 PM, Chuck Weinstock wrote:
When I looked originally all of the qrunners were running.
Probably because it took a message to trigger the exception and there hadn't at that point been enough messages to hit the retry limit.
I believe I have solved my problem by nuking the old install of Mailman and reinstalling from scratch (and reusing the config.pck, etc files.)
If you are interested I can still supply the information asked for below, but I won’t waste your time otherwise.
If you're satisfied that you've solved the issue, I'm happy.
Thanks for offering, but I only wanted that info to help you find a solution.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan

Whoops. The reinstalled Mailman stopped working with the same problem overnight. Two of the eight qrunners crashed. I have 3-4 lists and one of them will not open in the web admin interface. It times out as per the apache log: [Tue Sep 27 09:45:53.591373 2016] [cgi:warn] [pid 2483] [client 128.237.211.152:49581] AH01220: Timeout waiting for output from CGI script /usr/lib/mailman/cgi-bin/admin, referer: http://www.conjel.co/mailman/admin/fttc [Tue Sep 27 09:45:53.592426 2016] [cgi:error] [pid 2483] [client 128.237.211.152:49581] Script timed out before returning headers: admin, referer: http://www.conjel.co/mailman/admin/fttc [Tue Sep 27 09:46:53.639699 2016] [cgi:warn] [pid 2483] [client 128.237.211.152:49581] AH01220: Timeout waiting for output from CGI script /usr/lib/mailman/cgi-bin/admin, referer: http://www.conjel.co/mailman/admin/fttc [Tue Sep 27 09:46:53.640524 2016] [reqtimeout:info] [pid 2483] [client 128.237.211.152:49581] AH01382: Request body read timeout Here is the access log from the same time frame: 128.237.211.152 - - [27/Sep/2016:09:44:51 -0400] "GET /mailman/admin/fttc HTTP/1.1" 200 2078 128.237.211.152 - - [27/Sep/2016:09:44:53 -0400] "POST /mailman/admin/fttc HTTP/1.1" 504 247 Here is the qrunner log (from earlier when the two qrunners stopped): Sep 27 06:09:59 2016 (7136) Master qrunner detected subprocess exit (pid: 1194, sig: 9, sts: None, class: VirginRunner, slice: 1/1) [restarting] Sep 27 06:09:59 2016 (1439) VirginRunner qrunner started. Sep 27 06:13:22 2016 (7136) Master qrunner detected subprocess exit (pid: 1246, sig: 9, sts: None, class: IncomingRunner, slice: 1/1) [restarting] Sep 27 06:13:23 2016 (1564) IncomingRunner qrunner started. Sep 27 06:15:09 2016 (7136) Master qrunner detected subprocess exit (pid: 1439, sig: 9, sts: None, class: VirginRunner, slice: 1/1) [restarting] Sep 27 06:15:09 2016 (1679) VirginRunner qrunner started. Sep 27 06:18:00 2016 (7136) Master qrunner detected subprocess exit (pid: 1564, sig: 9, sts: None, class: IncomingRunner, slice: 1/1) [restarting] Sep 27 06:18:00 2016 (1786) IncomingRunner qrunner started. Sep 27 06:20:30 2016 (7136) Master qrunner detected subprocess exit (pid: 1679, sig: 9, sts: None, class: VirginRunner, slice: 1/1) [restarting] Sep 27 06:20:31 2016 (1917) VirginRunner qrunner started. Sep 27 06:21:56 2016 (7136) Master qrunner detected subprocess exit (pid: 1786, sig: 9, sts: None, class: IncomingRunner, slice: 1/1) [restarting] Sep 27 06:21:56 2016 (1980) IncomingRunner qrunner started. Sep 27 06:24:28 2016 (7136) Master qrunner detected subprocess exit (pid: 1917, sig: 9, sts: None, class: VirginRunner, slice: 1/1) [restarting] Sep 27 06:24:29 2016 (2048) VirginRunner qrunner started. Sep 27 06:25:55 2016 (7136) Master qrunner detected subprocess exit (pid: 1980, sig: 9, sts: None, class: IncomingRunner, slice: 1/1) [restarting] Sep 27 06:25:56 2016 (2160) IncomingRunner qrunner started. Sep 27 06:28:06 2016 (7136) Master qrunner detected subprocess exit (pid: 2048, sig: 9, sts: None, class: VirginRunner, slice: 1/1) [restarting] Sep 27 06:28:06 2016 (2223) VirginRunner qrunner started. Sep 27 06:30:03 2016 (7136) Master qrunner detected subprocess exit (pid: 2160, sig: 9, sts: None, class: IncomingRunner, slice: 1/1) [restarting] Sep 27 06:30:03 2016 (2317) IncomingRunner qrunner started. Sep 27 06:32:36 2016 (7136) Master qrunner detected subprocess exit (pid: 2223, sig: 9, sts: None, class: VirginRunner, slice: 1/1) [restarting] Sep 27 06:32:37 2016 (2443) VirginRunner qrunner started. Sep 27 06:34:03 2016 (7136) Master qrunner detected subprocess exit (pid: 2317, sig: 9, sts: None, class: IncomingRunner, slice: 1/1) [restarting] Sep 27 06:34:04 2016 (2494) IncomingRunner qrunner started. Sep 27 06:36:44 2016 (7136) Master qrunner detected subprocess exit (pid: 2443, sig: 9, sts: None, class: VirginRunner, slice: 1/1) [restarting] Sep 27 06:36:44 2016 (7136) Qrunner VirginRunner reached maximum restart limit of 10, not restarting. Sep 27 06:45:04 2016 (7136) Master qrunner detected subprocess exit (pid: 2494, sig: 9, sts: None, class: IncomingRunner, slice: 1/1) [restarting] Sep 27 06:45:04 2016 (7136) Qrunner IncomingRunner reached maximum restart limit of 10, not restarting. Finally this is the only error in the Mailman error file since the reinstall last night. Sep 26 20:59:51 2016 admin(8885): @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ admin(8885): [----- Mailman Version: 2.1.15 -----] admin(8885): [----- Traceback ------] admin(8885): Traceback (most recent call last): admin(8885): File "/usr/lib/mailman/scripts/driver", line 112, in run_main admin(8885): main() admin(8885): File "/usr/lib/mailman/Mailman/Cgi/admindb.py", line 198, in main admin(8885): mlist.Save() admin(8885): File "/usr/lib/mailman/Mailman/MailList.py", line 578, in Save admin(8885): self.__save(dict) admin(8885): File "/usr/lib/mailman/Mailman/MailList.py", line 555, in __save admin(8885): os.link(fname, fname_last) admin(8885): OSError: [Errno 1] Operation not permitted admin(8885): [----- Python Information -----] admin(8885): sys.version = 2.7.5 (default, Sep 15 2016, 22:37:39) [GCC 4.8.5 20150623 (Red Hat 4.8.5-4)] admin(8885): sys.executable = /usr/bin/python admin(8885): sys.prefix = /usr admin(8885): sys.exec_prefix = /usr admin(8885): sys.path = ['/usr/lib/mailman/pythonlib', '/usr/lib/mailman', '/usr/lib/mailman/scripts', '/usr/lib/mailman', '/usr/li b64/python27.zip', '/usr/lib64/python2.7/', '/usr/lib64/python2.7/plat-linux2', '/usr/lib64/python2.7/lib-tk', '/usr/lib64/python2.7/lib-old ', '/usr/lib64/python2.7/lib-dynload', '/usr/lib/python2.7/site-packages'] admin(8885): sys.platform = linux2 admin(8885): [----- Environment Variables -----] admin(8885): HTTP_REFERER: http://conjel.co/mailman/admindb/dsn admin(8885): CONTEXT_DOCUMENT_ROOT: /usr/lib/mailman/cgi-bin/ admin(8885): SERVER_SOFTWARE: Apache/2.4.6 (CentOS) OpenSSL/1.0.1e-fips PHP/5.4.16 admin(8885): CONTEXT_PREFIX: /mailman/ admin(8885): SERVER_SIGNATURE: admin(8885): REQUEST_METHOD: POST admin(8885): PATH_INFO: /dsn admin(8885): HTTP_ORIGIN: http://conjel.co admin(8885): SERVER_PROTOCOL: HTTP/1.1 admin(8885): QUERY_STRING: admin(8885): CONTENT_LENGTH: 39 admin(8885): HTTP_USER_AGENT: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36 admin(8885): HTTP_CONNECTION: keep-alive admin(8885): HTTP_COOKIE: mailman+admin=280200000069afc0e95773280000003665333231613538636235383833376661383331666565643265333961653063313 3366130663062 admin(8885): SERVER_NAME: conjel.co admin(8885): REMOTE_ADDR: 2601:547:f00:cf2c:8c4a:63df:fcba:58e9 admin(8885): PATH_TRANSLATED: /home/personal/htdocs/dsn admin(8885): SERVER_PORT: 80 admin(8885): SERVER_ADDR: 2001:4800:7818:103:be76:4eff:fe04:5321 admin(8885): DOCUMENT_ROOT: /home/personal/htdocs admin(8885): PYTHONPATH: /usr/lib/mailman admin(8885): SCRIPT_FILENAME: /usr/lib/mailman/cgi-bin/admindb admin(8885): SERVER_ADMIN: root@localhost admin(8885): HTTP_HOST: conjel.co admin(8885): SCRIPT_NAME: /mailman/admindb admin(8885): HTTP_UPGRADE_INSECURE_REQUESTS: 1 admin(8885): HTTP_CACHE_CONTROL: max-age=0 admin(8885): REQUEST_URI: /mailman/admindb/dsn admin(8885): HTTP_ACCEPT: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8 admin(8885): GATEWAY_INTERFACE: CGI/1.1 admin(8885): REMOTE_PORT: 63197 admin(8885): HTTP_ACCEPT_LANGUAGE: en-US,en;q=0.8 admin(8885): REQUEST_SCHEME: http admin(8885): CONTENT_TYPE: application/x-www-form-urlencoded admin(8885): HTTP_ACCEPT_ENCODING: gzip, deflate admin(8885): UNIQUE_ID: V@nEh3AeyVpBSf2Pn@BbogAAAAI

On 09/27/2016 06:55 AM, Chuck Weinstock wrote:
The CGIs are timing out. This is normally caused by a locked list.
sig: 9 is a SIGKILL. This seems to say that something external is killing the runner. This is likely the same or a similar underlying cause as the CGI timeouts, but is different as the CGIs are independent of the qrunners.
This is a permission or security manager (SELinux, apparmor, ?) issue. First try running Mailman's 'bin/check_perms -f` as root. If that fixes things, it may help. Also, see <https://wiki.list.org/x/17891756>. Note that Mailman's CGI wrappers must be group mailman and SETGID. In particular, these files must not be on a file system mounted with 'nosuid'. If none of this helps, try disabling SELinux. The qrunners being SIGKILLed is still a bit mysterious, but that could be related to a permissions or SELinux issue. -- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan

Mark, I referred to https://wiki.list.org/x/17891756 <https://wiki.list.org/x/17891756> before I even contacted the list. All of the cgi wrappers are suid. check_perms run as root finds no problems. One thing I noticed is that there was no locks directory anywhere in the installation. Is this normal? (Places I looked: /var/lib/mailman /usr/lib/mailman and /etc/mailman.) Also the /var/log/mailman/lock was empty but now shows: Sep 27 10:51:45 2016 (12700) fttc.lock lifetime has expired, breaking Sep 27 10:51:45 2016 (12700) File "/usr/lib/mailman/bin/qrunner", line 278, in <module> Sep 27 10:51:45 2016 (12700) main() Sep 27 10:51:45 2016 (12700) File "/usr/lib/mailman/bin/qrunner", line 238, in main Sep 27 10:51:45 2016 (12700) qrunner.run() Sep 27 10:51:45 2016 (12700) File "/usr/lib/mailman/Mailman/Queue/Runner.py", line 70, in run Sep 27 10:51:45 2016 (12700) filecnt = self._oneloop() Sep 27 10:51:45 2016 (12700) File "/usr/lib/mailman/Mailman/Queue/Runner.py", line 119, in _oneloop Sep 27 10:51:45 2016 (12700) self._onefile(msg, msgdata) Sep 27 10:51:45 2016 (12700) File "/usr/lib/mailman/Mailman/Queue/Runner.py", line 190, in _onefile Sep 27 10:51:45 2016 (12700) keepqueued = self._dispose(mlist, msg, msgdata) Sep 27 10:51:45 2016 (12700) File "/usr/lib/mailman/Mailman/Queue/IncomingRunner.py", line 115, in _dispose Sep 27 10:51:45 2016 (12700) mlist.Lock(timeout=mm_cfg.LIST_LOCK_TIMEOUT) Sep 27 10:51:45 2016 (12700) File "/usr/lib/mailman/Mailman/MailList.py", line 161, in Lock Sep 27 10:51:45 2016 (12700) self.__lock.lock(timeout) Sep 27 10:51:45 2016 (12700) File "/usr/lib/mailman/Mailman/LockFile.py", line 306, in lock Sep 27 10:51:45 2016 (12700) important=True) Sep 27 10:51:45 2016 (12700) File "/usr/lib/mailman/Mailman/LockFile.py", line 416, in __writelog Sep 27 10:51:45 2016 (12700) traceback.print_stack(file=logf) (Which is referencing the list in question.) Thanks again, Chuck

On 09/27/2016 07:54 AM, Chuck Weinstock wrote:
Your mailman locks directory is /var/lock/mailman/. See <https://wiki.list.org/x/8486953>.
I suspect the problem with the CGIs has to do with the qrunners being KILLed and leaving the list locked.
That still doesn't explain why the qrunners are being SIGKILLed.
Is there anything in /var/spool/mailman/shunt, /var/spool/mailman/retry or /var/spool/mailman/in?
If so, what does Mailman's 'bin/dumpdb -p' produce on those files? I'm looking for some kind of message corruption and also the metadata.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan

Ok, there were a whole bunch of lock files that I cleared and now I can access the admin page.
The shunt directory contains 591 different lines of the form:
-rw-rw---- 1 mailman mailman 1196 Sep 26 20:25 1474919192.85809+72fc7bd3e80b280fe5def1b842936d832e60126f.pck
All from around the same time.
The other two directories are empty.
Applying dumpdb to a few of them I see that they are each spam. (No indication of corruption that I can see.)
Here’s part of an example:
[----- start pickle file -----] <----- start object 1 -----> From Cora4@only-4u.com Thu Sep 22 08:58:08 2016 Return-Path: <Cora4@only-4u.com> X-Original-To: wg10.4@dependability.org Delivered-To: wg10.4@dependability.org Received: from [103.206.131.149] (unknown [103.206.131.149]) by personal2.localdomain (Postfix) with ESMTP id ED0984762 for <wg10.4@dependability.org>; Thu, 22 Sep 2016 08:58:07 -0400 (EDT) Content-type: multipart/mixed; boundary=Apple-Mail-D4D73760-E06639FC-B9B1373-86986B42-2365F1462A7E2E3657B Content-transfer-encoding: 7bit From: "Cora wilkes" <Cora4@only-4u.com> MIME-version: 1.0 (1.0) Date: Thu, 22 Sep 2016 18:28:05 +0530 Subject: Invoice INV00001226 Message-id: <2BD4D6-22C09DB0-F12148EE-B60C5033-85FB1AE4E06B6427C130@only-4u.com> To: wg10.4@dependability.org X-Mailer: iPhone Mail (13G35) Envelope-To: <wg10.4@dependability.org>
--Apple-Mail-D4D73760-E06639FC-B9B1373-86986B42-2365F1462A7E2E3657B Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit
Please find our invoice attached.
--Apple-Mail-D4D73760-E06639FC-B9B1373-86986B42-2365F1462A7E2E3657B Content-Type: application/zip; name=Invoice_INV00001226.zip; x-apple-part-url=D654925-14D6BA-BC29666A-033E-AB15C1B5891F982D3F0C Content-Disposition: attachment; filename=Invoice_INV00001226.zip Content-Transfer-Encoding: base64
[base64 content]
--Apple-Mail-D4D73760-E06639FC-B9B1373-86986B42-2365F1462A7E2E3657B--
<----- start object 2 -----> { '_parsemsg': False, 'listname': 'wg10.4', 'received_time': 1474549089.019715, 'tolist': 1, 'version': 3} [----- end pickle file -----]

On 09/27/2016 08:20 AM, Chuck Weinstock wrote:
Ok, there were a whole bunch of lock files that I cleared and now I can access the admin page.
OK.
Maybe you were hit with a massive spam attack and that triggered something that caused the qrunners to die, but there normally would also be error log messages. Did you check for a rotated log?
If these are older shunt entries (before you reinstalled), possibly there was a permissions issue at that time that prevented writing the error log and maybe caused other issues. The one you posted had a received time from last Thursday, 22 Sept.
The spam message entries in the shunt queue can just be deleted.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
participants (2)
-
Chuck Weinstock
-
Mark Sapiro