Messages got stuck in "in" queue due to one "bad" message
Hello,
We are running mailman 2.1.9. Recently we had a problem that mailman's Incoming qrunner died hard, which caused messages accumulated under qfiles/in directory. Restarting mailman didn't help. We finally identified a message on top of the queue, moved it aside, restarted mailman again, then mail will start flow.
Putting that message back to the in queue will trigger the problem. The sender actually tried to send it again, and it caused the same problem.
Before I got into the details of mailman logs, I'd like to ask:
What's the best way to detect "stalled" condition like this? We do monitor qrunner processes but in this case it didn't help because the qrunners were still running. Now I put in a script to monitor the number of files in "in" queue, and if it reaches a threshold, sends alert.
Why this message didn't get moved to shunt directory?
Can the mailman recover itself without human intervention?
Details are followed:
The following error message will repeat a couple of times in error log when this message was processed:
Feb 20 07:55:53 2007 qrunner(25722): Traceback (most recent call last): Feb 20 07:55:53 2007 qrunner(25722): File "/var/lib/mailman/bin/qrunner", line 278, in ? Feb 20 07:55:53 2007 qrunner(25722): main() Feb 20 07:55:53 2007 qrunner(25722): File "/var/lib/mailman/bin/qrunner", line 238, in main Feb 20 07:55:53 2007 qrunner(25722): qrunner.run() Feb 20 07:55:53 2007 qrunner(25722): File "/usr/lib/mailman/Mailman/Queue/Runner.py", line 71, in run Feb 20 07:55:53 2007 qrunner(25722): filecnt = self._oneloop() Feb 20 07:55:53 2007 qrunner(25722): File "/usr/lib/mailman/Mailman/Queue/Runner.py", line 100, in _oneloop Feb 20 07:55:53 2007 qrunner(25722): msg, msgdata = self._switchboard.dequeue(filebase) Feb 20 07:55:53 2007 qrunner(25722): File "/usr/lib/mailman/Mailman/Queue/Switchboard.py", line 164, in dequeue Feb 20 07:55:53 2007 qrunner(25722): msg = email.message_from_string(msg, Message.Message) Feb 20 07:55:53 2007 qrunner(25722): File "/usr/lib/python2.3/email/__init__.py", line 52, in message_from_string Feb 20 07:55:53 2007 qrunner(25722): return Parser(_class, strict=strict).parsestr(s) Feb 20 07:55:53 2007 qrunner(25722): File "/usr/lib/python2.3/email/Parser.py", line 75, in parsestr Feb 20 07:55:53 2007 qrunner(25722): return self.parse(StringIO(text), headersonly=headersonly) Feb 20 07:55:53 2007 qrunner(25722): File "/usr/lib/python2.3/email/Parser.py", line 64, in parse Feb 20 07:55:53 2007 qrunner(25722): self._parsebody(root, fp, firstbodyline) Feb 20 07:55:53 2007 qrunner(25722): File "/usr/lib/python2.3/email/Parser.py", line 240, in _parsebody Feb 20 07:55:53 2007 qrunner(25722): msgobj = self.parsestr(part) Feb 20 07:55:53 2007 qrunner(25722): File "/usr/lib/python2.3/email/Parser.py", line 75, in parsestr Feb 20 07:55:53 2007 qrunner(25722): return self.parse(StringIO(text), headersonly=headersonly) Feb 20 07:55:53 2007 qrunner(25722): File "/usr/lib/python2.3/email/Parser.py", line 64, in parse Feb 20 07:55:53 2007 qrunner(25722): self._parsebody(root, fp, firstbodyline) Feb 20 07:55:53 2007 qrunner(25722): File "/usr/lib/python2.3/email/Parser.py", line 146, in _parsebody Feb 20 07:55:53 2007 qrunner(25722): boundary = container.get_boundary() Feb 20 07:55:53 2007 qrunner(25722): File "/usr/lib/python2.3/email/Message.py", line 743, in get_boundary Feb 20 07:55:53 2007 qrunner(25722): boundary = self.get_param('boundary', missing) Feb 20 07:55:53 2007 qrunner(25722): File "/usr/lib/python2.3/email/Message.py", line 608, in get_param Feb 20 07:55:53 2007 qrunner(25722): for k, v in self._get_params_preserve(failobj, header): Feb 20 07:55:53 2007 qrunner(25722): File "/usr/lib/python2.3/email/Message.py", line 555, in _get_params_preserve Feb 20 07:55:53 2007 qrunner(25722): params = Utils.decode_params(params) Feb 20 07:55:53 2007 qrunner(25722): File "/usr/lib/python2.3/email/Utils.py", line 337, in decode_params Feb 20 07:55:53 2007 qrunner(25722): charset, language, value = decode_rfc2231(EMPTYSTRING.join(value)) Feb 20 07:55:53 2007 qrunner(25722): File "/usr/lib/python2.3/email/Utils.py", line 284, in decode_rfc2231 Feb 20 07:55:53 2007 qrunner(25722): charset, language, s = parts Feb 20 07:55:53 2007 qrunner(25722): ValueError : unpack list of wrong size
The qrunner log shows: Feb 20 07:55:52 2007 (27933) Master qrunner detected subprocess exit (pid: 27941, sig: None, sts: 1, class: IncomingRunner, slice: 1/4) [restarting] Feb 20 07:55:53 2007 (25722) IncomingRunner qrunner started. Feb 20 07:55:53 2007 (27933) Master qrunner detected subprocess exit (pid: 25722, sig: None, sts: 1, class: IncomingRunner, slice: 1/4) [restarting] Feb 20 07:55:53 2007 (25723) IncomingRunner qrunner started. Feb 20 07:55:53 2007 (27933) Master qrunner detected subprocess exit (pid: 25723, sig: None, sts: 1, class: IncomingRunner, slice: 1/4) [restarting] Feb 20 07:55:53 2007 (25724) IncomingRunner qrunner started. Feb 20 07:55:53 2007 (27933) Master qrunner detected subprocess exit (pid: 25724, sig: None, sts: 1, class: IncomingRunner, slice: 1/4) [restarting] Feb 20 07:55:53 2007 (25725) IncomingRunner qrunner started. Feb 20 07:55:53 2007 (27933) Master qrunner detected subprocess exit (pid: 25725, sig: None, sts: 1, class: IncomingRunner, slice: 1/4) [restarting] Feb 20 07:55:53 2007 (25726) IncomingRunner qrunner started. Feb 20 07:55:54 2007 (27933) Master qrunner detected subprocess exit (pid: 25726, sig: None, sts: 1, class: IncomingRunner, slice: 1/4) [restarting] Feb 20 07:55:54 2007 (25727) IncomingRunner qrunner started. Feb 20 07:55:54 2007 (27933) Master qrunner detected subprocess exit (pid: 25727, sig: None, sts: 1, class: IncomingRunner, slice: 1/4) [restarting] Feb 20 07:55:54 2007 (25728) IncomingRunner qrunner started. Feb 20 07:55:54 2007 (27933) Master qrunner detected subprocess exit (pid: 25728, sig: None, sts: 1, class: IncomingRunner, slice: 1/4) [restarting] Feb 20 07:55:54 2007 (25729) IncomingRunner qrunner started. Feb 20 07:55:54 2007 (27933) Master qrunner detected subprocess exit (pid: 25729, sig: None, sts: 1, class: IncomingRunner, slice: 1/4) [restarting] Feb 20 07:55:54 2007 (25731) IncomingRunner qrunner started. Feb 20 07:55:55 2007 (27933) Master qrunner detected subprocess exit (pid: 25731, sig: None, sts: 1, class: IncomingRunner, slice: 1/4) [restarting] Feb 20 07:55:55 2007 (25734) IncomingRunner qrunner started. Feb 20 07:55:55 2007 (27933) Master qrunner detected subprocess exit (pid: 25734, sig: None, sts: 1, class: IncomingRunner, slice: 1/4) [restarting]
Xueshan
Xueshan Feng wrote:
We are running mailman 2.1.9. Recently we had a problem that mailman's Incoming qrunner died hard, which caused messages accumulated under qfiles/in directory. Restarting mailman didn't help. We finally identified a message on top of the queue, moved it aside, restarted mailman again, then mail will start flow.
Putting that message back to the in queue will trigger the problem. The sender actually tried to send it again, and it caused the same problem.
Before I got into the details of mailman logs, I'd like to ask:
- What's the best way to detect "stalled" condition like this? We do monitor qrunner processes but in this case it didn't help because the qrunners were still running. Now I put in a script to monitor the number of files in "in" queue, and if it reaches a threshold, sends alert.
Actually, If I understand the situation correctly, IncomingRunner will not be running. It will die and be restarted, but after the 10th restart/die, it won't restart again (but this seems incorrect - see below).
- Why this message didn't get moved to shunt directory?
I don't know why it didn't shunt. From the traceback, the message should have shunted and that should have been the end of it. Also, the problem with decode_rfc2231 in email.Utils that seems to be the underlying issue here was fixed in email 2.5.8 which shipped with Mailman 2.1.9 and should be in Mailman's pythonlib directory. Your traceback says you are getting the email library from your Python 2.3 installation. This is not correct and should not happen if Mailman is properly installed.
If I understand the actual problem with the message, it is caused by an error in email 2.5.7 and earlier and is precipitated by a message with an apostrophe (') in the subject.
- Can the mailman recover itself without human intervention?
I don't understand why the message didn't shunt. I also don't understand why IncomingRunner was restarted more than 10 times. Have you changed
MAX_RESTARTS = 10
in bin/mailmanctl?
Was there a second traceback with only one date/time header following the one you report? If so, what was it?
Have you changed Mailman/Queue/Runner.py in any way?
Details are followed:
More comments below:
The following error message will repeat a couple of times in error log when this message was processed:
Feb 20 07:55:53 2007 qrunner(25722): Traceback (most recent call last): Feb 20 07:55:53 2007 qrunner(25722): File "/var/lib/mailman/bin/qrunner", line 278, in ? Feb 20 07:55:53 2007 qrunner(25722): main() Feb 20 07:55:53 2007 qrunner(25722): File "/var/lib/mailman/bin/qrunner", line 238, in main Feb 20 07:55:53 2007 qrunner(25722): qrunner.run() Feb 20 07:55:53 2007 qrunner(25722): File "/usr/lib/mailman/Mailman/Queue/Runner.py", line 71, in run Feb 20 07:55:53 2007 qrunner(25722): filecnt = self._oneloop() Feb 20 07:55:53 2007 qrunner(25722): File "/usr/lib/mailman/Mailman/Queue/Runner.py", line 100, in _oneloop Feb 20 07:55:53 2007 qrunner(25722): msg, msgdata = self._switchboard.dequeue(filebase) Feb 20 07:55:53 2007 qrunner(25722): File "/usr/lib/mailman/Mailman/Queue/Switchboard.py", line 164, in dequeue Feb 20 07:55:53 2007 qrunner(25722): msg = email.message_from_string(msg, Message.Message) Feb 20 07:55:53 2007 qrunner(25722): File "/usr/lib/python2.3/email/__init__.py", line 52, in message_from_string
This and subsequent email modules should come from /usr/lib/mailman/pythonlib/email, not from /usr/lib/python2.3/email.
Feb 20 07:55:53 2007 qrunner(25722): return Parser(_class, strict=strict).parsestr(s) Feb 20 07:55:53 2007 qrunner(25722): File "/usr/lib/python2.3/email/Parser.py", line 75, in parsestr Feb 20 07:55:53 2007 qrunner(25722): return self.parse(StringIO(text), headersonly=headersonly) Feb 20 07:55:53 2007 qrunner(25722): File "/usr/lib/python2.3/email/Parser.py", line 64, in parse Feb 20 07:55:53 2007 qrunner(25722): self._parsebody(root, fp, firstbodyline) Feb 20 07:55:53 2007 qrunner(25722): File "/usr/lib/python2.3/email/Parser.py", line 240, in _parsebody Feb 20 07:55:53 2007 qrunner(25722): msgobj = self.parsestr(part) Feb 20 07:55:53 2007 qrunner(25722): File "/usr/lib/python2.3/email/Parser.py", line 75, in parsestr Feb 20 07:55:53 2007 qrunner(25722): return self.parse(StringIO(text), headersonly=headersonly) Feb 20 07:55:53 2007 qrunner(25722): File "/usr/lib/python2.3/email/Parser.py", line 64, in parse Feb 20 07:55:53 2007 qrunner(25722): self._parsebody(root, fp, firstbodyline) Feb 20 07:55:53 2007 qrunner(25722): File "/usr/lib/python2.3/email/Parser.py", line 146, in _parsebody Feb 20 07:55:53 2007 qrunner(25722): boundary = container.get_boundary() Feb 20 07:55:53 2007 qrunner(25722): File "/usr/lib/python2.3/email/Message.py", line 743, in get_boundary Feb 20 07:55:53 2007 qrunner(25722): boundary = self.get_param('boundary', missing) Feb 20 07:55:53 2007 qrunner(25722): File "/usr/lib/python2.3/email/Message.py", line 608, in get_param Feb 20 07:55:53 2007 qrunner(25722): for k, v in self._get_params_preserve(failobj, header): Feb 20 07:55:53 2007 qrunner(25722): File "/usr/lib/python2.3/email/Message.py", line 555, in _get_params_preserve Feb 20 07:55:53 2007 qrunner(25722): params = Utils.decode_params(params) Feb 20 07:55:53 2007 qrunner(25722): File "/usr/lib/python2.3/email/Utils.py", line 337, in decode_params Feb 20 07:55:53 2007 qrunner(25722): charset, language, value = decode_rfc2231(EMPTYSTRING.join(value)) Feb 20 07:55:53 2007 qrunner(25722): File "/usr/lib/python2.3/email/Utils.py", line 284, in decode_rfc2231 Feb 20 07:55:53 2007 qrunner(25722): charset, language, s = parts Feb 20 07:55:53 2007 qrunner(25722): ValueError : unpack list of wrong size
Is there a second, different traceback immediately following this?
The qrunner log shows: Feb 20 07:55:52 2007 (27933) Master qrunner detected subprocess exit (pid: 27941, sig: None, sts: 1, class: IncomingRunner, slice: 1/4) [restarting] Feb 20 07:55:53 2007 (25722) IncomingRunner qrunner started. Feb 20 07:55:53 2007 (27933) Master qrunner detected subprocess exit (pid: 25722, sig: None, sts: 1, class: IncomingRunner, slice: 1/4) [restarting] Feb 20 07:55:53 2007 (25723) IncomingRunner qrunner started. Feb 20 07:55:53 2007 (27933) Master qrunner detected subprocess exit (pid: 25723, sig: None, sts: 1, class: IncomingRunner, slice: 1/4) [restarting] Feb 20 07:55:53 2007 (25724) IncomingRunner qrunner started. Feb 20 07:55:53 2007 (27933) Master qrunner detected subprocess exit (pid: 25724, sig: None, sts: 1, class: IncomingRunner, slice: 1/4) [restarting] Feb 20 07:55:53 2007 (25725) IncomingRunner qrunner started. Feb 20 07:55:53 2007 (27933) Master qrunner detected subprocess exit (pid: 25725, sig: None, sts: 1, class: IncomingRunner, slice: 1/4) [restarting] Feb 20 07:55:53 2007 (25726) IncomingRunner qrunner started. Feb 20 07:55:54 2007 (27933) Master qrunner detected subprocess exit (pid: 25726, sig: None, sts: 1, class: IncomingRunner, slice: 1/4) [restarting] Feb 20 07:55:54 2007 (25727) IncomingRunner qrunner started. Feb 20 07:55:54 2007 (27933) Master qrunner detected subprocess exit (pid: 25727, sig: None, sts: 1, class: IncomingRunner, slice: 1/4) [restarting] Feb 20 07:55:54 2007 (25728) IncomingRunner qrunner started. Feb 20 07:55:54 2007 (27933) Master qrunner detected subprocess exit (pid: 25728, sig: None, sts: 1, class: IncomingRunner, slice: 1/4) [restarting] Feb 20 07:55:54 2007 (25729) IncomingRunner qrunner started. Feb 20 07:55:54 2007 (27933) Master qrunner detected subprocess exit (pid: 25729, sig: None, sts: 1, class: IncomingRunner, slice: 1/4) [restarting] Feb 20 07:55:54 2007 (25731) IncomingRunner qrunner started. Feb 20 07:55:55 2007 (27933) Master qrunner detected subprocess exit (pid: 25731, sig: None, sts: 1, class: IncomingRunner, slice: 1/4) [restarting] Feb 20 07:55:55 2007 (25734) IncomingRunner qrunner started. Feb 20 07:55:55 2007 (27933) Master qrunner detected subprocess exit (pid: 25734, sig: None, sts: 1, class: IncomingRunner, slice: 1/4) [restarting]
Why more than 10 restarts?
There is a post at <http://mail.python.org/pipermail/mailman-developers/2007-February/019426.htm...> about a somewhat different problem with a similar result. There is a patch attached to that post which may help this situation, but the real issues here are two: Why isn't your Mailman using pythonlib, and Why did IncomingRunner die and not shunt the message?
-- Mark Sapiro <msapiro@value.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
Mark Sapiro writes:
- Why this message didn't get moved to shunt directory?
I don't know why it didn't shunt.
At least in earlier implementations of the email lib and Mailman, the original parse of the message was not enclosed in the shunt mechanism, so the exception got caught by the catchall handler, not the shunt handler.
As of 2.1.5, the I18N stuff made assumptions that certain things were ASCII, and that it was handling MIME conversions correctly. These assumptions have been regularly violated, and patches have been applied piecemeal, but AFAICS (from a 2.1.5 vs. 2.1.8 diff I did a while back) the fundamental architectural issue was not addressed. It would appear that that still hasn't been done.
It's not obvious to me that that's a change that's appropriate for 2.1.
Stephen J. Turnbull wrote:
At least in earlier implementations of the email lib and Mailman, the original parse of the message was not enclosed in the shunt mechanism, so the exception got caught by the catchall handler, not the shunt handler.
You're right Stephen. I overlooked that. The dequeue of the message is in it's own try, not in the main try that runs the process. Thus, the failure to parse the message in dequeue is supposed to result in the message being discarded and an 'Ignoring unparseable message:' entry in the 'error' log. The problem here is ValueError is not one of the exceptions we catch in the try around dequeue. We only catch email.Errors.MessageParseError.
Another problem is in implementing backup queue entries to prevent message loss in the event of catastrophic failure, we forgot to delete the backup queue entry at this juncture. The attached patch needs to be applied to Mailman/Queue/Runner.py, but that won't fix this immediate problem. For that, we need to expand the exceptions we catch at
try:
# Ask the switchboard for the message and metadata
objects # associated with this filebase. msg, msgdata = self._switchboard.dequeue(filebase) except email.Errors.MessageParseError, e:
A workaround in the interim is to change the except above to
except Exception, e:
This will require some more thought for a permanent fix.
-- Mark Sapiro <msapiro@value.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
On Tue, 2007-02-20 at 15:37 -0800, Mark Sapiro wrote:
Xueshan Feng wrote:
We are running mailman 2.1.9. Recently we had a problem that mailman's Incoming qrunner died hard, which caused messages accumulated under qfiles/in directory. Restarting mailman didn't help. We finally identified a message on top of the queue, moved it aside, restarted mailman again, then mail will start flow.
Putting that message back to the in queue will trigger the problem. The sender actually tried to send it again, and it caused the same problem.
Before I got into the details of mailman logs, I'd like to ask:
- What's the best way to detect "stalled" condition like this? We do monitor qrunner processes but in this case it didn't help because the qrunners were still running. Now I put in a script to monitor the number of files in "in" queue, and if it reaches a threshold, sends alert.
Actually, If I understand the situation correctly, IncomingRunner will not be running. It will die and be restarted, but after the 10th restart/die, it won't restart again (but this seems incorrect - see below).
The qrunner log shows they died, but if I do ps -ef |grep IncomingRunner, they are still there.
- Why this message didn't get moved to shunt directory?
I don't know why it didn't shunt. From the traceback, the message should have shunted and that should have been the end of it. Also, the problem with decode_rfc2231 in email.Utils that seems to be the underlying issue here was fixed in email 2.5.8 which shipped with Mailman 2.1.9 and should be in Mailman's pythonlib directory. Your traceback says you are getting the email library from your Python 2.3 installation. This is not correct and should not happen if Mailman is properly installed.
You nailed it!
We indeed re-packaged Mailman with a lot of Stanford's own patches. When we upgrade from 2.1.8 to 2.1.9, I missed Mailman's own pythonlib/email installation in Debian rule file.
I just re-packaged it and tested the new package. The message that caused the problem now is accepted without a problem!
- Can the mailman recover itself without human intervention?
I don't understand why the message didn't shunt. I also don't understand why IncomingRunner was restarted more than 10 times. Have you changed
MAX_RESTARTS = 10
in bin/mailmanctl?
No that's not changed. It is still 10.
Was there a second traceback with only one date/time header following the one you report? If so, what was it?
Yes, there were more similar traceback in the logs. There are 10 actually.
Have you changed Mailman/Queue/Runner.py in any way?
Not that program.
Details are followed:
More comments below:
The following error message will repeat a couple of times in error log when this message was processed:
Feb 20 07:55:53 2007 qrunner(25722): Traceback (most recent call last): Feb 20 07:55:53 2007 qrunner(25722): File "/var/lib/mailman/bin/qrunner", line 278, in ? Feb 20 07:55:53 2007 qrunner(25722): main() Feb 20 07:55:53 2007 qrunner(25722): File "/var/lib/mailman/bin/qrunner", line 238, in main Feb 20 07:55:53 2007 qrunner(25722): qrunner.run() Feb 20 07:55:53 2007 qrunner(25722): File "/usr/lib/mailman/Mailman/Queue/Runner.py", line 71, in run Feb 20 07:55:53 2007 qrunner(25722): filecnt = self._oneloop() Feb 20 07:55:53 2007 qrunner(25722): File "/usr/lib/mailman/Mailman/Queue/Runner.py", line 100, in _oneloop Feb 20 07:55:53 2007 qrunner(25722): msg, msgdata = self._switchboard.dequeue(filebase) Feb 20 07:55:53 2007 qrunner(25722): File "/usr/lib/mailman/Mailman/Queue/Switchboard.py", line 164, in dequeue Feb 20 07:55:53 2007 qrunner(25722): msg = email.message_from_string(msg, Message.Message) Feb 20 07:55:53 2007 qrunner(25722): File "/usr/lib/python2.3/email/__init__.py", line 52, in message_from_string
This and subsequent email modules should come from /usr/lib/mailman/pythonlib/email, not from /usr/lib/python2.3/email.
Feb 20 07:55:53 2007 qrunner(25722): return Parser(_class, strict=strict).parsestr(s) Feb 20 07:55:53 2007 qrunner(25722): File "/usr/lib/python2.3/email/Parser.py", line 75, in parsestr Feb 20 07:55:53 2007 qrunner(25722): return self.parse(StringIO(text), headersonly=headersonly) Feb 20 07:55:53 2007 qrunner(25722): File "/usr/lib/python2.3/email/Parser.py", line 64, in parse Feb 20 07:55:53 2007 qrunner(25722): self._parsebody(root, fp, firstbodyline) Feb 20 07:55:53 2007 qrunner(25722): File "/usr/lib/python2.3/email/Parser.py", line 240, in _parsebody Feb 20 07:55:53 2007 qrunner(25722): msgobj = self.parsestr(part) Feb 20 07:55:53 2007 qrunner(25722): File "/usr/lib/python2.3/email/Parser.py", line 75, in parsestr Feb 20 07:55:53 2007 qrunner(25722): return self.parse(StringIO(text), headersonly=headersonly) Feb 20 07:55:53 2007 qrunner(25722): File "/usr/lib/python2.3/email/Parser.py", line 64, in parse Feb 20 07:55:53 2007 qrunner(25722): self._parsebody(root, fp, firstbodyline) Feb 20 07:55:53 2007 qrunner(25722): File "/usr/lib/python2.3/email/Parser.py", line 146, in _parsebody Feb 20 07:55:53 2007 qrunner(25722): boundary = container.get_boundary() Feb 20 07:55:53 2007 qrunner(25722): File "/usr/lib/python2.3/email/Message.py", line 743, in get_boundary Feb 20 07:55:53 2007 qrunner(25722): boundary = self.get_param('boundary', missing) Feb 20 07:55:53 2007 qrunner(25722): File "/usr/lib/python2.3/email/Message.py", line 608, in get_param Feb 20 07:55:53 2007 qrunner(25722): for k, v in self._get_params_preserve(failobj, header): Feb 20 07:55:53 2007 qrunner(25722): File "/usr/lib/python2.3/email/Message.py", line 555, in _get_params_preserve Feb 20 07:55:53 2007 qrunner(25722): params = Utils.decode_params(params) Feb 20 07:55:53 2007 qrunner(25722): File "/usr/lib/python2.3/email/Utils.py", line 337, in decode_params Feb 20 07:55:53 2007 qrunner(25722): charset, language, value = decode_rfc2231(EMPTYSTRING.join(value)) Feb 20 07:55:53 2007 qrunner(25722): File "/usr/lib/python2.3/email/Utils.py", line 284, in decode_rfc2231 Feb 20 07:55:53 2007 qrunner(25722): charset, language, s = parts Feb 20 07:55:53 2007 qrunner(25722): ValueError : unpack list of wrong size
Is there a second, different traceback immediately following this?
The qrunner log shows: Feb 20 07:55:52 2007 (27933) Master qrunner detected subprocess exit (pid: 27941, sig: None, sts: 1, class: IncomingRunner, slice: 1/4) [restarting] Feb 20 07:55:53 2007 (25722) IncomingRunner qrunner started. Feb 20 07:55:53 2007 (27933) Master qrunner detected subprocess exit (pid: 25722, sig: None, sts: 1, class: IncomingRunner, slice: 1/4) [restarting] Feb 20 07:55:53 2007 (25723) IncomingRunner qrunner started. Feb 20 07:55:53 2007 (27933) Master qrunner detected subprocess exit (pid: 25723, sig: None, sts: 1, class: IncomingRunner, slice: 1/4) [restarting] Feb 20 07:55:53 2007 (25724) IncomingRunner qrunner started. Feb 20 07:55:53 2007 (27933) Master qrunner detected subprocess exit (pid: 25724, sig: None, sts: 1, class: IncomingRunner, slice: 1/4) [restarting] Feb 20 07:55:53 2007 (25725) IncomingRunner qrunner started. Feb 20 07:55:53 2007 (27933) Master qrunner detected subprocess exit (pid: 25725, sig: None, sts: 1, class: IncomingRunner, slice: 1/4) [restarting] Feb 20 07:55:53 2007 (25726) IncomingRunner qrunner started. Feb 20 07:55:54 2007 (27933) Master qrunner detected subprocess exit (pid: 25726, sig: None, sts: 1, class: IncomingRunner, slice: 1/4) [restarting] Feb 20 07:55:54 2007 (25727) IncomingRunner qrunner started. Feb 20 07:55:54 2007 (27933) Master qrunner detected subprocess exit (pid: 25727, sig: None, sts: 1, class: IncomingRunner, slice: 1/4) [restarting] Feb 20 07:55:54 2007 (25728) IncomingRunner qrunner started. Feb 20 07:55:54 2007 (27933) Master qrunner detected subprocess exit (pid: 25728, sig: None, sts: 1, class: IncomingRunner, slice: 1/4) [restarting] Feb 20 07:55:54 2007 (25729) IncomingRunner qrunner started. Feb 20 07:55:54 2007 (27933) Master qrunner detected subprocess exit (pid: 25729, sig: None, sts: 1, class: IncomingRunner, slice: 1/4) [restarting] Feb 20 07:55:54 2007 (25731) IncomingRunner qrunner started. Feb 20 07:55:55 2007 (27933) Master qrunner detected subprocess exit (pid: 25731, sig: None, sts: 1, class: IncomingRunner, slice: 1/4) [restarting] Feb 20 07:55:55 2007 (25734) IncomingRunner qrunner started. Feb 20 07:55:55 2007 (27933) Master qrunner detected subprocess exit (pid: 25734, sig: None, sts: 1, class: IncomingRunner, slice: 1/4) [restarting]
Why more than 10 restarts?
Hmm, that's 11. I don't know. The following line followed the 11th restarting:
Feb 20 07:55:55 2007 (27933) Qrunner IncomingRunner reached maximum restart limit of 10, not restarting.
There is a post at <http://mail.python.org/pipermail/mailman-developers/2007-February/019426.htm...> about a somewhat different problem with a similar result. There is a patch attached to that post which may help this situation, but the real issues here are two: Why isn't your Mailman using pythonlib, and Why did IncomingRunner die and not shunt the message?
Is it possible because of it was using python2.3's email lib instead of Mailman's own?
Thanks a lot! The problem is fixed after the Mailman's own pythonlib/email is installed.
Xueshan
--
Xueshan Feng (aka. Susan Feng)
Shared Services, ITSS Stanford University, CA 94305-3090
255 Panama St. Room 157, Polya Hall Stanford University Stanford, CA 94305-4136
Xueshan Feng wrote:
On Tue, 2007-02-20 at 15:37 -0800, Mark Sapiro wrote:
Xueshan Feng wrote:
Actually, If I understand the situation correctly, IncomingRunner will not be running. It will die and be restarted, but after the 10th restart/die, it won't restart again (but this seems incorrect - see below).
The qrunner log shows they died, but if I do ps -ef |grep IncomingRunner, they are still there.
All four of them?
- Can the mailman recover itself without human intervention?
I don't understand why the message didn't shunt.
I understand why the message didn't shunt. See Stephen's post and my reply in this thread.
Why more than 10 restarts?
Hmm, that's 11. I don't know. The following line followed the 11th restarting:
Feb 20 07:55:55 2007 (27933) Qrunner IncomingRunner reached maximum restart limit of 10, not restarting.
Maybe it counts from zero and gets 11. It looks OK. It reached the limit and didn't restart, but there should only have been the 3 remaining IncomingRunners, not four when you looked.
Thanks a lot! The problem is fixed after the Mailman's own pythonlib/email is installed.
There are still some remaining Runner issues. These will all be cleaned up in Mailman 2.1.10. In the mean time, I may release a patch after I fully work out what to do.
The bottom line is this error should not have caused the IncomingRunner to die, and it should not have been looping on the bad message. These will be fixed.
-- Mark Sapiro <msapiro@value.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
Quoting Mark Sapiro <msapiro@value.net>:
Xueshan Feng wrote:
On Tue, 2007-02-20 at 15:37 -0800, Mark Sapiro wrote:
Xueshan Feng wrote:
Actually, If I understand the situation correctly, IncomingRunner will not be running. It will die and be restarted, but after the 10th restart/die, it won't restart again (but this seems incorrect - see below).
The qrunner log shows they died, but if I do ps -ef |grep IncomingRunner, they are still there.
All four of them?
Only 3:
mailman:/var/lib/mailman/qfiles# ps -ef | grep Inco list 21394 21383 0 18:45 ? 00:00:17 /usr/bin/python /var/lib/mailman/bin/qrunner --runner=IncomingRunner:0:4 -s list 21395 21383 0 18:45 ? 00:00:13 /usr/bin/python /var/lib/mailman/bin/qrunner --runner=IncomingRunner:1:4 -s list 21396 21383 0 18:45 ? 00:00:15 /usr/bin/python /var/lib/mailman/bin/qrunner --runner=IncomingRunner:2:4 -s
A these 3 remaining ones would not do the work?
Xueshan
Xueshan Feng wrote:
Quoting Mark Sapiro <msapiro@value.net>:
All four of them?
Only 3:
mailman:/var/lib/mailman/qfiles# ps -ef | grep Inco list 21394 21383 0 18:45 ? 00:00:17 /usr/bin/python /var/lib/mailman/bin/qrunner --runner=IncomingRunner:0:4 -s list 21395 21383 0 18:45 ? 00:00:13 /usr/bin/python /var/lib/mailman/bin/qrunner --runner=IncomingRunner:1:4 -s list 21396 21383 0 18:45 ? 00:00:15 /usr/bin/python /var/lib/mailman/bin/qrunner --runner=IncomingRunner:2:4 -s
A these 3 remaining ones would not do the work?
You are processing the 'in' queue in 4 slices with 4 IncomingRunner instances.
You do this by redefining or modifying the QRUNNERS list in mm_cfg.py. The entry in the list for IncomingRunner is
('IncomingRunner', 4),
meaning that there will be four IncomingRunner instances processing slices 0:4, 1:4, 2:4 and 3:4 (the 3:4 runner is missing above).
Slicing works by dividing the queue entry hash space into slices (2^n in general, 4 in this case. Each slice has its own runner. A portion of each queue entry's name (the part between '+' and '.pck') is a 20 byte hash of a pickle of the message + the list name + the time of enqueueing expressed as 40 hex digits. In this case, the first quarter of the hash space is processed by the 0:4 runner, the second quarter by the 1:4 runner, etc. If one of the four runners isn't running, its portion of the hash space will not be processed.
-- Mark Sapiro <msapiro@value.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
Quoting Mark Sapiro <msapiro@value.net>:
Xueshan Feng wrote:
Quoting Mark Sapiro <msapiro@value.net>:
All four of them?
Only 3:
mailman:/var/lib/mailman/qfiles# ps -ef | grep Inco list 21394 21383 0 18:45 ? 00:00:17 /usr/bin/python /var/lib/mailman/bin/qrunner --runner=IncomingRunner:0:4 -s list 21395 21383 0 18:45 ? 00:00:13 /usr/bin/python /var/lib/mailman/bin/qrunner --runner=IncomingRunner:1:4 -s list 21396 21383 0 18:45 ? 00:00:15 /usr/bin/python /var/lib/mailman/bin/qrunner --runner=IncomingRunner:2:4 -s
A these 3 remaining ones would not do the work?
You are processing the 'in' queue in 4 slices with 4 IncomingRunner instances.
You do this by redefining or modifying the QRUNNERS list in mm_cfg.py. The entry in the list for IncomingRunner is
('IncomingRunner', 4),
That's exactly I have in our mm_cfg.py.
meaning that there will be four IncomingRunner instances processing slices 0:4, 1:4, 2:4 and 3:4 (the 3:4 runner is missing above).
Slicing works by dividing the queue entry hash space into slices (2^n in general, 4 in this case. Each slice has its own runner. A portion of each queue entry's name (the part between '+' and '.pck') is a 20 byte hash of a pickle of the message + the list name + the time of enqueueing expressed as 40 hex digits. In this case, the first quarter of the hash space is processed by the 0:4 runner, the second quarter by the 1:4 runner, etc. If one of the four runners isn't running, its portion of the hash space will not be processed.
Wow, that's interesting. I was wondering how the long file name was generated in the queue. For effective monitoring, I should check total qrunner instances to match up what's configured in mm_cfg.
Thanks for the explanation!
Xueshan
participants (4)
-
Mark Sapiro -
Stephen J. Turnbull -
Xueshan Feng -
Xueshan Feng