On Tue, 2007-02-20 at 15:37 -0800, Mark Sapiro wrote:
Xueshan Feng wrote:
We are running mailman 2.1.9. Recently we had a problem that mailman's Incoming qrunner died hard, which caused messages accumulated under qfiles/in directory. Restarting mailman didn't help. We finally identified a message on top of the queue, moved it aside, restarted mailman again, then mail will start flow.
Putting that message back to the in queue will trigger the problem. The sender actually tried to send it again, and it caused the same problem.
Before I got into the details of mailman logs, I'd like to ask:
- What's the best way to detect "stalled" condition like this? We do monitor qrunner processes but in this case it didn't help because the qrunners were still running. Now I put in a script to monitor the number of files in "in" queue, and if it reaches a threshold, sends alert.
Actually, If I understand the situation correctly, IncomingRunner will not be running. It will die and be restarted, but after the 10th restart/die, it won't restart again (but this seems incorrect - see below).
The qrunner log shows they died, but if I do ps -ef |grep IncomingRunner, they are still there.
- Why this message didn't get moved to shunt directory?
I don't know why it didn't shunt. From the traceback, the message should have shunted and that should have been the end of it. Also, the problem with decode_rfc2231 in email.Utils that seems to be the underlying issue here was fixed in email 2.5.8 which shipped with Mailman 2.1.9 and should be in Mailman's pythonlib directory. Your traceback says you are getting the email library from your Python 2.3 installation. This is not correct and should not happen if Mailman is properly installed.
You nailed it!
We indeed re-packaged Mailman with a lot of Stanford's own patches. When we upgrade from 2.1.8 to 2.1.9, I missed Mailman's own pythonlib/email installation in Debian rule file.
I just re-packaged it and tested the new package. The message that caused the problem now is accepted without a problem!
- Can the mailman recover itself without human intervention?
I don't understand why the message didn't shunt. I also don't understand why IncomingRunner was restarted more than 10 times. Have you changed
MAX_RESTARTS = 10
in bin/mailmanctl?
No that's not changed. It is still 10.
Was there a second traceback with only one date/time header following the one you report? If so, what was it?
Yes, there were more similar traceback in the logs. There are 10 actually.
Have you changed Mailman/Queue/Runner.py in any way?
Not that program.
Details are followed:
More comments below:
The following error message will repeat a couple of times in error log when this message was processed:
Feb 20 07:55:53 2007 qrunner(25722): Traceback (most recent call last): Feb 20 07:55:53 2007 qrunner(25722): File "/var/lib/mailman/bin/qrunner", line 278, in ? Feb 20 07:55:53 2007 qrunner(25722): main() Feb 20 07:55:53 2007 qrunner(25722): File "/var/lib/mailman/bin/qrunner", line 238, in main Feb 20 07:55:53 2007 qrunner(25722): qrunner.run() Feb 20 07:55:53 2007 qrunner(25722): File "/usr/lib/mailman/Mailman/Queue/Runner.py", line 71, in run Feb 20 07:55:53 2007 qrunner(25722): filecnt = self._oneloop() Feb 20 07:55:53 2007 qrunner(25722): File "/usr/lib/mailman/Mailman/Queue/Runner.py", line 100, in _oneloop Feb 20 07:55:53 2007 qrunner(25722): msg, msgdata = self._switchboard.dequeue(filebase) Feb 20 07:55:53 2007 qrunner(25722): File "/usr/lib/mailman/Mailman/Queue/Switchboard.py", line 164, in dequeue Feb 20 07:55:53 2007 qrunner(25722): msg = email.message_from_string(msg, Message.Message) Feb 20 07:55:53 2007 qrunner(25722): File "/usr/lib/python2.3/email/__init__.py", line 52, in message_from_string
This and subsequent email modules should come from /usr/lib/mailman/pythonlib/email, not from /usr/lib/python2.3/email.
Feb 20 07:55:53 2007 qrunner(25722): return Parser(_class, strict=strict).parsestr(s) Feb 20 07:55:53 2007 qrunner(25722): File "/usr/lib/python2.3/email/Parser.py", line 75, in parsestr Feb 20 07:55:53 2007 qrunner(25722): return self.parse(StringIO(text), headersonly=headersonly) Feb 20 07:55:53 2007 qrunner(25722): File "/usr/lib/python2.3/email/Parser.py", line 64, in parse Feb 20 07:55:53 2007 qrunner(25722): self._parsebody(root, fp, firstbodyline) Feb 20 07:55:53 2007 qrunner(25722): File "/usr/lib/python2.3/email/Parser.py", line 240, in _parsebody Feb 20 07:55:53 2007 qrunner(25722): msgobj = self.parsestr(part) Feb 20 07:55:53 2007 qrunner(25722): File "/usr/lib/python2.3/email/Parser.py", line 75, in parsestr Feb 20 07:55:53 2007 qrunner(25722): return self.parse(StringIO(text), headersonly=headersonly) Feb 20 07:55:53 2007 qrunner(25722): File "/usr/lib/python2.3/email/Parser.py", line 64, in parse Feb 20 07:55:53 2007 qrunner(25722): self._parsebody(root, fp, firstbodyline) Feb 20 07:55:53 2007 qrunner(25722): File "/usr/lib/python2.3/email/Parser.py", line 146, in _parsebody Feb 20 07:55:53 2007 qrunner(25722): boundary = container.get_boundary() Feb 20 07:55:53 2007 qrunner(25722): File "/usr/lib/python2.3/email/Message.py", line 743, in get_boundary Feb 20 07:55:53 2007 qrunner(25722): boundary = self.get_param('boundary', missing) Feb 20 07:55:53 2007 qrunner(25722): File "/usr/lib/python2.3/email/Message.py", line 608, in get_param Feb 20 07:55:53 2007 qrunner(25722): for k, v in self._get_params_preserve(failobj, header): Feb 20 07:55:53 2007 qrunner(25722): File "/usr/lib/python2.3/email/Message.py", line 555, in _get_params_preserve Feb 20 07:55:53 2007 qrunner(25722): params = Utils.decode_params(params) Feb 20 07:55:53 2007 qrunner(25722): File "/usr/lib/python2.3/email/Utils.py", line 337, in decode_params Feb 20 07:55:53 2007 qrunner(25722): charset, language, value = decode_rfc2231(EMPTYSTRING.join(value)) Feb 20 07:55:53 2007 qrunner(25722): File "/usr/lib/python2.3/email/Utils.py", line 284, in decode_rfc2231 Feb 20 07:55:53 2007 qrunner(25722): charset, language, s = parts Feb 20 07:55:53 2007 qrunner(25722): ValueError : unpack list of wrong size
Is there a second, different traceback immediately following this?
The qrunner log shows: Feb 20 07:55:52 2007 (27933) Master qrunner detected subprocess exit (pid: 27941, sig: None, sts: 1, class: IncomingRunner, slice: 1/4) [restarting] Feb 20 07:55:53 2007 (25722) IncomingRunner qrunner started. Feb 20 07:55:53 2007 (27933) Master qrunner detected subprocess exit (pid: 25722, sig: None, sts: 1, class: IncomingRunner, slice: 1/4) [restarting] Feb 20 07:55:53 2007 (25723) IncomingRunner qrunner started. Feb 20 07:55:53 2007 (27933) Master qrunner detected subprocess exit (pid: 25723, sig: None, sts: 1, class: IncomingRunner, slice: 1/4) [restarting] Feb 20 07:55:53 2007 (25724) IncomingRunner qrunner started. Feb 20 07:55:53 2007 (27933) Master qrunner detected subprocess exit (pid: 25724, sig: None, sts: 1, class: IncomingRunner, slice: 1/4) [restarting] Feb 20 07:55:53 2007 (25725) IncomingRunner qrunner started. Feb 20 07:55:53 2007 (27933) Master qrunner detected subprocess exit (pid: 25725, sig: None, sts: 1, class: IncomingRunner, slice: 1/4) [restarting] Feb 20 07:55:53 2007 (25726) IncomingRunner qrunner started. Feb 20 07:55:54 2007 (27933) Master qrunner detected subprocess exit (pid: 25726, sig: None, sts: 1, class: IncomingRunner, slice: 1/4) [restarting] Feb 20 07:55:54 2007 (25727) IncomingRunner qrunner started. Feb 20 07:55:54 2007 (27933) Master qrunner detected subprocess exit (pid: 25727, sig: None, sts: 1, class: IncomingRunner, slice: 1/4) [restarting] Feb 20 07:55:54 2007 (25728) IncomingRunner qrunner started. Feb 20 07:55:54 2007 (27933) Master qrunner detected subprocess exit (pid: 25728, sig: None, sts: 1, class: IncomingRunner, slice: 1/4) [restarting] Feb 20 07:55:54 2007 (25729) IncomingRunner qrunner started. Feb 20 07:55:54 2007 (27933) Master qrunner detected subprocess exit (pid: 25729, sig: None, sts: 1, class: IncomingRunner, slice: 1/4) [restarting] Feb 20 07:55:54 2007 (25731) IncomingRunner qrunner started. Feb 20 07:55:55 2007 (27933) Master qrunner detected subprocess exit (pid: 25731, sig: None, sts: 1, class: IncomingRunner, slice: 1/4) [restarting] Feb 20 07:55:55 2007 (25734) IncomingRunner qrunner started. Feb 20 07:55:55 2007 (27933) Master qrunner detected subprocess exit (pid: 25734, sig: None, sts: 1, class: IncomingRunner, slice: 1/4) [restarting]
Why more than 10 restarts?
Hmm, that's 11. I don't know. The following line followed the 11th restarting:
Feb 20 07:55:55 2007 (27933) Qrunner IncomingRunner reached maximum restart limit of 10, not restarting.
There is a post at <http://mail.python.org/pipermail/mailman-developers/2007-February/019426.htm...> about a somewhat different problem with a similar result. There is a patch attached to that post which may help this situation, but the real issues here are two: Why isn't your Mailman using pythonlib, and Why did IncomingRunner die and not shunt the message?
Is it possible because of it was using python2.3's email lib instead of Mailman's own?
Thanks a lot! The problem is fixed after the Mailman's own pythonlib/email is installed.
Xueshan
--
Xueshan Feng (aka. Susan Feng)
Shared Services, ITSS Stanford University, CA 94305-3090
255 Panama St. Room 157, Polya Hall Stanford University Stanford, CA 94305-4136