Debugging lost messages?
Folks,
I've got another mailing list server installation (Mailman 2.1.5
and postfix 2.2-20040504), and I've just discovered that one of the lists has been broken for about a month and I'm having some problems figuring out how messages are being lost.
The postfix logs are clearly showing the messages being sent to
"post", but no evidence of them shows up anywhere in the Mailman logs. They're not in the post, error, or vette logs, and I can't figure out where else to be looking. I also can't figure out where I need to add some more debugging information to the log output.
I'm stumped. Can someone give me some advice here?
-- Brad Knowles, <brad@stop.mail-abuse.org>
"Those who would give up essential Liberty, to purchase a little temporary Safety, deserve neither Liberty nor Safety."
-- Benjamin Franklin (1706-1790), reply of the Pennsylvania
Assembly to the Governor, November 11, 1755
SAGE member since 1995. See <http://www.sage.org/> for more info.
Brad Knowles wrote:
I've got another mailing list server installation (Mailman 2.1.5 and postfix 2.2-20040504), and I've just discovered that one of the lists has been broken for about a month and I'm having some problems figuring out how messages are being lost.
The postfix logs are clearly showing the messages being sent to "post", but no evidence of them shows up anywhere in the Mailman logs. They're not in the post, error, or vette logs, and I can't figure out where else to be looking. I also can't figure out where I need to add some more debugging information to the log output.
I'm stumped. Can someone give me some advice here?
See FAQ article 3.14 (just kidding :-)
Is archiving on? Are there members with delivery not disabled? If these are digest members is digestable 'yes'? Likewise, if there are regular members is nondigestable 'yes'?
Since there are presumably other, working lists, much of the FAQ isn't relevant, but check for list locks in locks/
Also, if other lists are working, the wrapper and the scripts/post script are presumably working, at least assuming that the wrapper in the aliases pipe is the same one for this list as for the others that work.
You could put a 'debug' in scripts/post to be sure, but all it does is put the message in qfiles/in. Check the queue or maybe all the queues, but when you say there's no evidence of the messages in Mailman, maybe you've already checked.
Once the message gets to qfiles/in, processing continues with Mailman/Queue/IncomingRunner.py which basically directs the message through the handlers in Mailman/Handlers/ which are listed in the list's 'pipeline' attribute or the GLOBAL_PIPELINE. Does the list have a 'pipeline' attribute? If so, are certain critical delivery handlers such as 'ToDigest', 'ToArchive' and 'ToOutgoing' all there?
Is there a lists/<listname>/extend.py file? If so, what is it intended to do and does it work?
It may be none of these things, but that's about all I can think of at the moment.
If I couldn't find anything obvious in the above ideas, I'd try putting a debug logging statement conditional on mlist.internal_name() == listname in Mailman/Queue/IncomingRunner.py in the while loop in _dopipeline to see how far it gets.
-- Mark Sapiro <msapiro@value.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
At 5:08 PM -0700 2005-10-04, Mark Sapiro wrote:
Is archiving on?
Yes. There's nothing in the archives for this list since the
15th of September, and so far as I can tell nothing has been sent out to any of the list members since then.
Are there members with delivery not disabled? If these
are digest members is digestable 'yes'? Likewise, if there are regular members is nondigestable 'yes'?
There are ten members of the list, none of which are in digest
mode, and none are disabled.
Since there are presumably other, working lists, much of the FAQ isn't relevant, but check for list locks in locks/
Nope, no locks. Of course, I had stopped and restarted Mailman,
and sent a test message through (which worked), before I checked to see if there were any locks.
Also, if other lists are working, the wrapper and the scripts/post script are presumably working, at least assuming that the wrapper in the aliases pipe is the same one for this list as for the others that work.
The other lists are working fine, so far as I can tell.
You could put a 'debug' in scripts/post to be sure, but all it does is put the message in qfiles/in. Check the queue or maybe all the queues, but when you say there's no evidence of the messages in Mailman, maybe you've already checked.
I had checked all the queues before stopping and restarting
Mailman, and they were all empty.
Once the message gets to qfiles/in, processing continues with Mailman/Queue/IncomingRunner.py which basically directs the message through the handlers in Mailman/Handlers/ which are listed in the list's 'pipeline' attribute or the GLOBAL_PIPELINE. Does the list have a 'pipeline' attribute? If so, are certain critical delivery handlers such as 'ToDigest', 'ToArchive' and 'ToOutgoing' all there?
I did not set up any list-specific pipeline, no.
Is there a lists/<listname>/extend.py file? If so, what is it intended to do and does it work?
No, there is no extend.py file for any of these lists. The
installation is pretty plain-jane. About the only thing we modified was some of the templates for auto responses, telling the sender that their message was being held, etc....
It may be none of these things, but that's about all I can think of at the moment.
I certainly couldn't think of anything else.
If I couldn't find anything obvious in the above ideas, I'd try putting a debug logging statement conditional on mlist.internal_name() == listname in Mailman/Queue/IncomingRunner.py in the while loop in _dopipeline to see how far it gets.
I'll give that a shot. Thanks!
-- Brad Knowles, <brad@stop.mail-abuse.org>
"Those who would give up essential Liberty, to purchase a little temporary Safety, deserve neither Liberty nor Safety."
-- Benjamin Franklin (1706-1790), reply of the Pennsylvania
Assembly to the Governor, November 11, 1755
SAGE member since 1995. See <http://www.sage.org/> for more info.
At 5:08 PM -0700 2005-10-04, Mark Sapiro wrote:
If I couldn't find anything obvious in the above ideas, I'd try putting a debug logging statement conditional on mlist.internal_name() == listname in Mailman/Queue/IncomingRunner.py in the while loop in _dopipeline to see how far it gets.
I added a bit more debugging, and the important section looks like this:
except Errors.DiscardMessage:
# Throw the message away; we need do nothing else with it.
syslog('vette', 'Message discarded, listname: %s, msgid: %s',
listname,
msg.get('message-id', 'n/a'))
return 0
except Errors.HoldMessage:
# Let the approval process take it from here. The message no
# longer needs to be queued.
return 0
except Errors.RejectMessage, e:
mlist.BounceMessage(msg, msgdata, e)
syslog('vette', 'Message bounced, listname: %s, msgid: %s',
listname,
msg.get('message-id', 'n/a'))
return 0
So, now I'm seeing the listname in conjunction with the "message
discarded" log entries, and I'm getting a completely new "message bounced" log entry which wasn't there before at all.
But I'm not seeing any details as to why a message is being
discarded or bouncing. I see two more messages that came in today to the list in question, and by matching message-ids and listnames between the Mailman "vette" log and the postfix syslog, I can tell that these two messages bounced. But not why. Maybe they were spam?
And I have yet to see any "normal" messages come into this list
so far, so I can't tell what may have been happening to them and why they weren't being posted.
-- Brad Knowles, <brad@stop.mail-abuse.org>
"Those who would give up essential Liberty, to purchase a little temporary Safety, deserve neither Liberty nor Safety."
-- Benjamin Franklin (1706-1790), reply of the Pennsylvania
Assembly to the Governor, November 11, 1755
SAGE member since 1995. See <http://www.sage.org/> for more info.
On Thu, 2005-10-06 at 12:39, Brad Knowles wrote:
But I'm not seeing any details as to why a message is being discarded or bouncing. I see two more messages that came in today to the list in question, and by matching message-ids and listnames between the Mailman "vette" log and the postfix syslog, I can tell that these two messages bounced. But not why. Maybe they were spam?
Maybe they're triggering one of your content (or other) filters? Have you got this list set up to discard non-members or something else of that nature?
-Barry
At 10:31 PM -0400 2005-10-06, Barry Warsaw wrote:
Maybe they're triggering one of your content (or other) filters? Have you got this list set up to discard non-members or something else of that nature?
I thought of those. All non-member postings are supposed to be
held for moderation, and there's nothing beyond the standard out-of-the-box "legacy" anti-spam filters in place for this list.
I've got a lot of anti-spam filters in SpamAssassin that might
also be causing these types of postings to be lost (reports from BitKeeper, which tend to run afoul of the "Chickenpox" rules), but then they shouldn't show up in the postfix log as having been delivered to the Mailman "post" process, and they certainly wouldn't trip any of the "legacy" spam filters.
This is what I find so frustrating about debugging this
particular process. As far as Mailman is concerned, you really can't get too much more plain-jane than what we're running. And yet, stuff has clearly been broken for about a month now, and I can't figure out why.
But if you've got ideas on some places where I can put in some
further debugging to see what's going on and why, I'll be glad to give that a shot.
-- Brad Knowles, <brad@stop.mail-abuse.org>
"Those who would give up essential Liberty, to purchase a little temporary Safety, deserve neither Liberty nor Safety."
-- Benjamin Franklin (1706-1790), reply of the Pennsylvania
Assembly to the Governor, November 11, 1755
SAGE member since 1995. See <http://www.sage.org/> for more info.
At 5:29 AM +0200 2005-10-07, Brad Knowles wrote:
This is what I find so frustrating about debugging this particular process. As far as Mailman is concerned, you really can't get too much more plain-jane than what we're running. And yet, stuff has clearly been broken for about a month now, and I can't figure out why.
One thing I've found particularly frustrating is that I can now
see messages being bounced which should not be (thanks to the additional debugging that I've put into place), but I can't figure out which handler is causing the messages to be bounced.
The code in IncomingRunner.py currently looks like this:
except Errors.RejectMessage, e:
mlist.BounceMessage(msg, msgdata, e)
syslog('vette', 'Message bounced, listname: %s, msgid: %s',
listname,
msg.get('message-id', 'n/a'))
return 0
What I'd like to do is add the name of the handler somewhere in
that line, but I'm not sure how to do that. I'm going to read up on programming in Python, but any advice or assistance that anyone can provide would be appreciated.
Once I can track down the offending handler, I can put in some
more debugging code into that particular routine, and try to get a better idea of why those messages are being inappropriately bounced.
-- Brad Knowles, <brad@stop.mail-abuse.org>
"Those who would give up essential Liberty, to purchase a little temporary Safety, deserve neither Liberty nor Safety."
-- Benjamin Franklin (1706-1790), reply of the Pennsylvania
Assembly to the Governor, November 11, 1755
SAGE member since 1995. See <http://www.sage.org/> for more info.
At 11:18 AM +0200 2005-10-10, Brad Knowles wrote:
What I'd like to do is add the name of the handler somewhere in that line, but I'm not sure how to do that. I'm going to read up on programming in Python, but any advice or assistance that anyone can provide would be appreciated.
Blargh. After all that Googling and reading, what I should have
done was to look earlier in that same routine. The code was already there. Sigh....
Anyway, I should now have the "modname" being printed in the
"vette" log, so that I should be able to figure out which handler is causing the inappropriate bounces. I'll let you know when I find out more.
-- Brad Knowles, <brad@stop.mail-abuse.org>
"Those who would give up essential Liberty, to purchase a little temporary Safety, deserve neither Liberty nor Safety."
-- Benjamin Franklin (1706-1790), reply of the Pennsylvania
Assembly to the Governor, November 11, 1755
SAGE member since 1995. See <http://www.sage.org/> for more info.
At 11:51 AM +0200 2005-10-10, Brad Knowles wrote:
Anyway, I should now have the "modname" being printed in the "vette" log, so that I should be able to figure out which handler is causing the inappropriate bounces. I'll let you know when I find out more.
Okay, I think I found the offending module. Hold.py will syslog
to "vette", if the message is being held. But Moderate.py will not syslog anything -- it passes a held message to Hold.py, but handles rejections and discards itself.
Moreover, Moderate.py uses two different methods of handling
rejections and discards -- subscribers are handled in-line, while non-subscribers are handled through the do_reject() and do_discard() subroutines. And that's the only place the do_reject() and do_discard() subroutines are used.
So, I can't even just drop in a bit of logging in the do_reject()
and do_discard() subroutines, since they aren't both used for subscribers and non-subscribers alike. Sigh....
Can a real Python programmer suggest some changes that would
create syslog messages for rejects and discards for both subscribers and non-subscribers alike, and maybe re-factor the code to re-use the do_discard() and do_reject() subroutines, or do I need to try to fumble around and fix these myself?
-- Brad Knowles, <brad@stop.mail-abuse.org>
"Those who would give up essential Liberty, to purchase a little temporary Safety, deserve neither Liberty nor Safety."
-- Benjamin Franklin (1706-1790), reply of the Pennsylvania
Assembly to the Governor, November 11, 1755
SAGE member since 1995. See <http://www.sage.org/> for more info.
Brad Knowles wrote:
Okay, I think I found the offending module. Hold.py will syslog to "vette", if the message is being held. But Moderate.py will not syslog anything -- it passes a held message to Hold.py, but handles rejections and discards itself.
Moreover, Moderate.py uses two different methods of handling rejections and discards -- subscribers are handled in-line, while non-subscribers are handled through the do_reject() and do_discard() subroutines. And that's the only place the do_reject() and do_discard() subroutines are used.
In a prior post, you indicate that IncomingRunner was detecting a RejectMessage exception. You wrote:
(Your modified) code in IncomingRunner.py currently looks like this:
except Errors.RejectMessage, e: mlist.BounceMessage(msg, msgdata, e) syslog('vette', 'Message bounced, listname: %s, msgid: %s', listname, msg.get('message-id', 'n/a')) return 0
and presumably you were seeing that log message. Thus, we know it is a 'reject' and not a 'discard'.
It looks like there are only 3 paths through Moderate.py that result in a reject. These are:
Post is from a moderated member and the list's member_moderation_action is reject.
Post is from a non-member in reject_these_nonmembers and not in accept or hold _these_nonmembers.
Post is from a non-member not in *_these_nonmembers and generic_nonmember_action is reject.
There are different values for the error message, e, that can distinguish the first case from the second two, but if this isn't enough, I would add some information to the logging statement above.
For example:
syslog('vette', 'Message bounced, listname: %s, \
msgid: %s Subject: %s, Sender: %s, Error: %s', listname, msg.get('message-id', 'n/a'), msg.get('subject', 'no subject'), msg.get_sender(), e)
Also, in all these cases mlist.BounceMessage(msg, msgdata, e) should be attempting to send a reject message to the poster. There should at least be something in the smtp* logs about this.
-- Mark Sapiro <msapiro@value.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
At 12:32 PM -0700 2005-10-10, Mark Sapiro wrote:
In a prior post, you indicate that IncomingRunner was detecting a RejectMessage exception.
Correct.
and presumably you were seeing that log message. Thus, we know it is a 'reject' and not a 'discard'.
I am seeing some discards as well, but I've only seen them a
couple of times, and I only saw them at the very beginning of this debugging process -- before I had added the additional information to the syslog output.
I have seen other discards since, but only those two for that list.
Post is from a moderated member and the list's member_moderation_action is reject.
I confirmed that the list's member_moderation_action was to hold.
Post is from a non-member in reject_these_nonmembers and not in accept or hold _these_nonmembers.
There are only a couple of non-members listed in
reject_these_nonmembers, and I don't think that either of them were involved.
Post is from a non-member not in *_these_nonmembers and generic_nonmember_action is reject.
For reasons I cannot comprehend, the generic_nonmember_action was
indeed set to reject, which it should not have been. I know that it used to be set to hold, because I was getting lots of notices about this message from a non-member being held, or that message, etc.... As they came up, I added them to the whitelist, so they wouldn't have to go through the moderation process again.
I know I didn't change the setting for this list, so one of the
other people on the project (and with access to the site admin password) must have.
There are different values for the error message, e, that can distinguish the first case from the second two, but if this isn't enough, I would add some information to the logging statement above.
When I was trying to print the error message "e", I was getting
errors sent to the console about bad formatting or somesuch. I had to remove that in order to get Mailman to run correctly.
For example:
syslog('vette', 'Message bounced, listname: %s, \
msgid: %s Subject: %s, Sender: %s, Error: %s', listname, msg.get('message-id', 'n/a'), msg.get('subject', 'no subject'), msg.get_sender(), e)
This is definitely the sort of error message that I'd like to see
get logged in Moderate.py. Only I'd like to see this get logged for all three code paths, and without having to have the same code replicated three times.
At the very least, if someone changes the
generic_nonmember_action on me again, with this much information being syslogged, I should hopefully be able to detect this situation much earlier and be able to correct it much faster.
Also, in all these cases mlist.BounceMessage(msg, msgdata, e) should be attempting to send a reject message to the poster. There should at least be something in the smtp* logs about this.
Ahh, that's another good idea. Thanks again!
-- Brad Knowles, <brad@stop.mail-abuse.org>
"Those who would give up essential Liberty, to purchase a little temporary Safety, deserve neither Liberty nor Safety."
-- Benjamin Franklin (1706-1790), reply of the Pennsylvania
Assembly to the Governor, November 11, 1755
SAGE member since 1995. See <http://www.sage.org/> for more info.
participants (3)
-
Barry Warsaw
-
Brad Knowles
-
Mark Sapiro