message posting in a loop with mailman 2.1b1
I just upgraded one of my servers to exim 4 and mailman choked badly on having exim refuse a message because exim did this:
DNS lookup of scruznet.com (MX) gave TRY_AGAIN scruznet.com in dns_again_means_nonexist? no (option unset) returning DNS_AGAIN lookuphost router: defer for champney@scruznet.com message: host lookup did not complete ----------- end verify ------------ accept: condition test deferred SMTP>> 451 Temporary local problem - please try later
The problem is that mailman decided that the whole post failed, and started to resend it in a loop. Apr 15 08:13:56 2002 (17389) post to keskydee from jean-luc@maisiere.com, size=3527, 9 failures Apr 15 08:17:54 2002 (17389) post to keskydee from jean-luc@maisiere.com, size=3527, 7 failures Apr 15 08:22:08 2002 (17389) post to keskydee from jean-luc@maisiere.com, size=3527, 8 failures (...)
This also caused bounce scores against all the users and angered the membership obviously. I'm not sure how mailman could deal better with that, granted, it can be fixed on the exim side, but the problem is not obvious and the looping post is nasty. Since mailman doesn't know who many receipients it delivered to when it gets an error from the MTA (4xx or 5xx), I recommend that mailman moves the message in a separate queue dir, logs an error (possibly Emailing the list owner in the process) and gives up on the message. Resending it is only going to piss off the users that are getting the message each time mm tries.
BTW, on the exim 4 side, I had: accept hosts = +localadds:+relay_from_hosts verify = recipient
I solved the problem by adding this at the beginning of my rcpt ACL: accept hosts = 127.0.0.1
Marc
Microsoft is to operating systems & security .... .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | Finger marc_f@merlins.org for PGP key
"MM" == Marc MERLIN <marc_news@vasoftware.com> writes:
MM> I just upgraded one of my servers to exim 4 and mailman choked
MM> badly on having exim refuse a message because exim did this:
SMTP> 451 Temporary local problem - please try later
MM> The problem is that mailman decided that the whole post
MM> failed, and started to resend it in a loop.
If I'm reading RFC 2821 correctly, this is the right thing for Mailman to do. A 451 error code means:
"The command was not accepted, and the requested action did not
occur. However, the error condition is temporary and the action
may be requested again. The sender should return to the beginning
of the command sequence (if any)."
So it looks to me like we're correct in assuming that none of the recipients of that chunk got the message. If Exim is doing partial deliveries and still returning 451, that doesn't seem right.
MM> This also caused bounce scores against all the users and
MM> angered the membership obviously.
We could debate whether it's right to increment bounce scores to recipients in a 451'd message. Maybe we should soft-bounce them (increment by 1.0 > score > 0). I could see an argument about just ignoring 451 responses w.r.t. the bounce processor.
MM> Since mailman doesn't know who many receipients it delivered
MM> to when it gets an error from the MTA (4xx or 5xx), I
Mailman shouldn't have to know. If it gets a 4xx or 5xx response for a batch of recipients, it has to assume that none of those recipients will get the message. I think this is mandated by the RFC.
MM> recommend that mailman moves the message in a separate queue
MM> dir, logs an error (possibly Emailing the list owner in the
MM> process) and gives up on the message. Resending it is only
MM> going to piss off the users that are getting the message each
MM> time mm tries.
Unless I'm misreading the RFC, you have to blame Exim for this.
-Barry
On Tue, Apr 16, 2002 at 06:33:59PM -0400, Barry A. Warsaw wrote:
SMTP> 451 Temporary local problem - please try later MM> The problem is that mailman decided that the whole post MM> failed, and started to resend it in a loop.
If I'm reading RFC 2821 correctly, this is the right thing for Mailman to do. A 451 error code means:
I'm not saying that mailman is incorrect on the interpretation of the RFC, I'm saying that if mailman feeds an incorrect Email address or something that causes the MTA to reject the mail, it will endlessly spam all the subscribers that are being delivered to every time mailman tries.
This can't be the desired behaviour...
What if mailman gets a 5xx? Does it give up on the message and drop it on the floor?
My point is that in both cases it should log a clear error of what happened, and save the message that triggered the problem somewhere. But you are right that in theory mailman should know who the message was delivered to, and who didn't get it.
"The command was not accepted, and the requested action did not occur. However, the error condition is temporary and the action may be requested again. The sender should return to the beginning of the command sequence (if any)."
So it looks to me like we're correct in assuming that none of the recipients of that chunk got the message. If Exim is doing partial deliveries and still returning 451, that doesn't seem right.
Here's what exim does:
220 mail2.merlins.org ESMTP Exim 4.01 #1 Tue, 16 Apr 2002 16:03:46 -0700 - mm1
helo foo
250 mail2.merlins.org Hello root at moremagic.merlins.org [204.80.101.251]
mail from: nobody@merlins.org
250 OK
rcpt to: nobody@uu.net
250 Accepted
rcpt to: champney@scruznet.com
451 Temporary local problem - please try later
Indeed this *shouldn't* have caused mailman to loop, but it sure did. Then I'm afraid the only explanation that comes to mind is that it delivered one block, got a 451 on the next one, and the previous block didn't get marked as delivered, and mailman delivers it again to people who already got it (I know, I got 12 copies in my mailbox before I was able to stop it)
Actually I checked closer, and my system is setup to use VERP, so mailman would only have issued one RCPT per message, or blocks of 1 receipient.
Ok, so my exim logs show: 2002-04-15 05:43:31 16x5pS-0004XH-00 => keskydee <keskydee@lists.merlins.org> F= <jean-luc@maisiere.com> R=mm21_main_director T=mm21_transport S=2850 2002-04-15 05:43:34 16x5pW-0004RZ-00 <= keskydee-bounces+jean-luc=maisiere.com@m erlins.org H=localhost (moremagic.merlins.org) [127.0.0.1]:49371 I=[127.0.0.1]:2 5 U=mailman P=esmtp S=1350 id=mailman.9.1018874612.25713.keskydee@lists.merlins. org T="Your message to Keskydee awaits moderator approval" from <keskydee-bounce s+jean-luc=maisiere.com@lists.merlins.org> for jean-luc@maisiere.com
(I'm still sleeping, wake up, approve the message, and it starts gettting sent)
2002-04-15 08:06:03 16x83P-0001tW-00 <= keskydee-bounces+sylvie=stanfordalumni.o rg@merlins.org H=localhost (moremagic.merlins.org) [127.0.0.1]:46865 I=[127.0.0. 1]:25 U=mailman P=esmtp S=3893 id=200204151259.OAA09104@serv5.sc3m.net T="[Kesky dee] recherche collaboration" from <keskydee-bounces+sylvie=stanfordalumni.org@l ists.merlins.org> for sylvie@stanfordalumni.org 2002-04-15 08:06:05 16x83P-0001tW-00 => sylvie@stanfordalumni.org F=<keskydee-bo unces+sylvie=stanfordalumni.org@merlins.org> R=lookuphost T=remote_smtp S=4009 H =mx.usa.net [165.212.65.113] C="250 Mail accepted (240gDoPFv0045M01)" 2002-04-15 08:06:05 16x83P-0001tW-00 Completed 2002-04-15 08:06:09 H=localhost (moremagic.merlins.org) [127.0.0.1]:46865 (mailm an) F=<keskydee-bounces+loic_fabro=notesetc.com@merlins.org> temporarily rejecte d RCPT <Loic_Fabro@notesetc.com>: host lookup did not complete
Ahah, first 4xx
2002-04-15 08:06:09 SMTP connection from localhost (moremagic.merlins.org) [127. 0.0.1]:46865 closed by QUIT mailman bails right after that.
2002-04-15 08:06:10 SMTP connection from localhost [127.0.0.1]:41158 (TCP/IP con nection count = 1) and comes back
delivers some other messages and gets another 4xx
2002-04-15 08:06:29 H=localhost (moremagic.merlins.org) [127.0.0.1]:41158 (mailm an) F=<keskydee-bounces+champney=scruznet.com@merlins.org> temporarily rejected RCPT <champney@scruznet.com>: host lookup did not complete 2002-04-15 08:06:29 SMTP connection from localhost (moremagic.merlins.org) [127. 0.0.1]:41158 closed by QUIT 2002-04-15 08:06:29 SMTP connection from localhost [127.0.0.1]:60022 (TCP/IP con nection count = 1)
delivers to more people, and then to me
2002-04-15 08:07:56 16x85E-00058R-00 <= keskydee-bounces+marc=merlins.org@merlins.org H=localhost (moremagic.merlins.org) [127.0.0.1]:42096 I=[127.0.0.1]:25 U=mailman P=esmtp S=3866 id=200204151259.OAA09104@serv5.sc3m.net T="[Keskydee] recherche collaboration" from <keskydee-bounces+marc=merlins.org@lists.merlins.org> for marc<at>merlins.org
Unfortunately, since there are 200+ members, I can't give you a sweet 10 lines of logs with the obvious problem, but I can show you this:
2002-04-15 08:14:27 16x8BX-0005WU-00 <= keskydee-bounces+marc=merlins.org@merlins.org H=localhost (moremagic.merlins.org) [127.0.0.1]:57729 I=[127.0.0.1]:25 U=mailman P=esmtp S=3866 id=200204151259.OAA09104@serv5.sc3m.net T="[Keskydee] recherche collaboration" from <keskydee-bounces+marc=merlins.org@lists.merlins.org> for marc<at>merlins.org
The same message gets sent back to me again, and again every 10-15mn until I manually yanked the message from the spool.
I know you'd rather have the line of code that does the bad thing, or even a patch, but I can at least tell you that something is amiss. My first report just wasn't good because I didn't research the failure deeply enough. Marc
Microsoft is to operating systems & security .... .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | Finger marc_f@merlins.org for PGP key
"MM" == Marc MERLIN <marc_news@vasoftware.com> writes:
MM> I'm not saying that mailman is incorrect on the interpretation MM> of the RFC, I'm saying that if mailman feeds an incorrect MM> Email address or something that causes the MTA to reject the MM> mail, it will endlessly spam all the subscribers that are MM> being delivered to every time mailman tries. Ah, now I understand. MM> This can't be the desired behaviour... What? You mean you don't like mailbombing your innocent list members? That's a marketing gimmick I use on people until they come out to see the band and buy at least 3 CDs! You've discovered my backdoor, so now I guess I have to rip it out. MM> What if mailman gets a 5xx? Does it give up on the message and MM> drop it on the floor? When MM gets a 5xx, it records that chunk's recipients as permfailures. For 4xx's it records them as tempfailures. Once delivery has been attempted to all the chunks, it will then process the two types of failures. For permanent fail recipients, it will periodically lock the affected mailing list and do bounce registration on them. For temporary fail recipients, it will attempt redelivery until 1) it makes no progress (i.e. the number of undelivered recips from the last attempt is the same as the number from this attempt), and 2) until mm_cfg.DELIVERY_RETRY_PERIOD is elapsed. If these conditions are met, it discards the message (this may not be the right thing to do). However, in eyeballing the code to write this response, I think I see the bug! The list of all recipients for this message is kept in the metadata dict, under the key `recips'. When we requeue a message with tempfailures for retry later, I failed to reset the `recips' value to just the list of tempfailed recipients. This would indeed cause the message to be later redelivered to everybody. I believe the fix is simple. Attached below is an untested patch. I'm way too tired to test this tonight, but I'll try to craft a test for this tomorrow to make sure it's correct, and if so I'll commit it for 2.1b2, perhaps also to be releaed tomorrow (I also need to get 2.0.10 out the door). Thanks, this was a real bug. -Barry -------------------- snip snip -------------------- Index: OutgoingRunner.py =================================================================== RCS file: /cvsroot/mailman/mailman/Mailman/Queue/OutgoingRunner.py,v retrieving revision 2.14 diff -u -r2.14 OutgoingRunner.py --- OutgoingRunner.py 10 Apr 2002 04:48:01 -0000 2.14 +++ OutgoingRunner.py 17 Apr 2002 04:09:30 -0000 @@ -101,15 +101,16 @@ last_recip_count = msgdata.get('last_recip_count', 0) deliver_until = msgdata.get('deliver_until', now) if len(recips) == last_recip_count: - # We didn't make any progress. + # We didn't make any progress, so don't attempt delivery any + # longer. BAW: is this the best disposition? if now > deliver_until: - # We won't attempt delivery any longer. return 0 else: # Keep trying to delivery this for 3 days deliver_until = now + mm_cfg.DELIVERY_RETRY_PERIOD msgdata['last_recip_count'] = len(recips) msgdata['deliver_until'] = deliver_until + msgdata['recips'] = recips # Requeue return 1 # We've successfully completed handling of this message
At 16:28 -0700 4/16/2002, Marc MERLIN wrote:
I'm not saying that mailman is incorrect on the interpretation of the RFC, I'm saying that if mailman feeds an incorrect Email address or something that causes the MTA to reject the mail, it will endlessly spam all the subscribers that are being delivered to every time mailman tries.
This can't be the desired behaviour...
Nor is it what I see running Mailman with Exim. But I haven't ventured into 2.1b1 yet.
How does Mailman deliver the messages to Exim?
--John
John Baxter jwblist@olympus.net Port Ludlow, WA, USA
On Wed, Apr 17, 2002 at 10:44:16AM -0700, John W Baxter wrote:
Nor is it what I see running Mailman with Exim. But I haven't ventured into 2.1b1 yet.
How does Mailman deliver the messages to Exim?
Err, over port 25. I'm not sure I understand the question you probably meant to ask though :)
Marc
Microsoft is to operating systems & security .... .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | Finger marc_f@merlins.org for PGP key
participants (3)
-
barry@zope.com
-
John W Baxter
-
Marc MERLIN