[Mailman-Developers] shipstopper bug

Greg Stein gstein@lyra.org
Fri, 08 Jan 1999 18:16:17 -0800


This is a multi-part message in MIME format.

--------------2F2A7A284F92DB72623DFFB
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit

I've reported this before, and I will reiterate it again. Moreover, I
would call it a "shipstopper" for the 1.0 release.

I'm using 1.0b6 here. This has hit me FIVE times in the past 24 hours.
Each time, I have to take manual action to fix things (as reported
before). Basically, shutting down the sendmail listener to stop the mail
loop, clearing the Mailman queue and sendmail queue of the loop-inducing
garbage, and then bringing it back up.

What happens is that somebody types in a request to the Mailman
subscription page with a *local* mail address (i.e. no domain name). The
confirmation email then gets sent. This bounces back to Mailman, but it
doesn't understand that it is a bounce and tries to process the dumb
thing. That fails massively (as seen in the attached mail), and responds
to mailer-daemon with the error. More on this in a bit.

What appears wonky is that Mailman seems to attempt to deliver the
confirmation over and over, nonstop.

Here is a portion of the mail log for RAB25492:

Jan  8 17:18:19 ns1 sendmail[25492]: RAA25492: <ch9517>... User unknown
Jan  8 17:18:19 ns1 sendmail[25492]: lost input channel from localhost
[127.0.0.1]
Jan  8 17:18:19 ns1 sendmail[25492]: RAA25492:
from=<hognews-request@eastsideharley.com>, size=0, class=0, pri=0,
nrcpts=0, proto=SMTP, relay=localhost [127.0.0.1]
Jan  8 17:18:19 ns1 sendmail[25492]: RAA25492: RAB25492: DSN:
<ch9517>... User unknown
Jan  8 17:18:27 ns1 sendmail[25492]: RAB25492:
to=|"/home/mailman/install/mail/wrapper mailcmd hognews",
delay=00:00:08, xdelay=00:00:07, mailer=prog, stat=Sent

Note the "lost input channel". I don't think that is right. The cycle
above repeats at 17:18:32. This goes on until I kill it.

In the above case, the mailer-deamon is responding to Mailman (the last
line). Mailman then processes that mail, and returns a message like the
attached garbage to root. I just deleted 1400 messages from my root
mailbox.

Some more information:

In logs/error, I see the following traceback repeated every 30 minutes:

Jan 08 11:12:07 1999 smtplib: Traceback (innermost last):
smtplib:   File "/home/mailman/install/cron/run_queue", line 31, in ?
smtplib:      OutgoingQueue.processQueue()
smtplib:   File "/home/mailman/install/Mailman/OutgoingQueue.py", line
38, in processQueue
smtplib:      Utils.TrySMTPDelivery(recip,sender,text,full_fname)
smtplib:   File "/home/mailman/install/Mailman/Utils.py", line 201, in
TrySMTPDelivery
smtplib:      con.send(to=recipient,frm=sender,text=text)
smtplib:   File "/home/mailman/install/Mailman/smtplib.py", line 75, in
send
smtplib:      self.getresp()
smtplib:   File "/home/mailman/install/Mailman/smtplib.py", line 147, in
getresp
smtplib:      raise bad, resp
smtplib: smtplib.error_proto :  550 <bzkeffer>... User unknown

I suspect that the "lost input channel" further above is due to a
similar condition. The traceback drops the socket, fails to log
anything, and fails to clear the confirmation email from the outgoing
queue. Mailman processes the error response from mailer-daemon and drops
it into the queue. It runs the queue which processes that response
(flooding the root email box) along with the confirmation (again), which
starts the loop again.

I also believe that I now understand why my machine died back in August
due to this bug, but that hasn't happened to me recently. Because of the
queueing nature of Mailman, there is only one loop occurring at a time.
HOWEVER: if that cron job wakes up and processes the queue, then it
starts a second, simultaneous loop. If enough of those 30-minute
cron-caused loop-creations occurs, then your machine is completely dead
as it thrashes, trying to process all those loops. I've been catching
them relatively soon in this past spate of them (although I did miss one
last night for a long while, causing my loadavg to hit 15... heh. it's
also a DNS server and its response became SLOWWWWWW... which led me to
look into it)

So, I see one or more errors here:

1) a possible traceback not being logged during Q processing
2) said traceback fouling the outgoing mail queue processing
3) should probably disallow domain-less addresses in all places
4) bounce detection needs to recognize the mail that caused the attached
response (read between the lines for the content)

And because this can swamp a box, I would highly recommend it gets fixed
:-)

thx
-g

--
Greg Stein, http://www.lyra.org/

--------------2F2A7A284F92DB72623DFFB
Content-Type: message/rfc822
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

Received: from ns1.lyra.org (root@ns1.lyra.org [208.192.43.10])
	by svpal.svpal.org (8.9.0/8.9.0) with ESMTP id RAA12486
	for <gstein@svpal.org>; Fri, 8 Jan 1999 17:05:39 -0800 (PST)
Received: (from root@localhost)
	by ns1.lyra.org (8.8.5/8.8.5) id RAA25542
	for gstein@lyra.org; Fri, 8 Jan 1999 17:19:53 -0800
Date: Fri, 8 Jan 1999 17:19:53 -0800
From: Greg Stein <gstein@ns1.lyra.org>
Message-Id: <199901090119.RAA25542@ns1.lyra.org>
To: gstein@lyra.org

(
Subject: Mailman results for Hognews
To: mailer-daemon@ns1.lyra.org

**** Subject line ignored: Returned mail: <ch9517>... User unknown

>>>> This is a MIME-encapsulated message
**** this: Command UNKNOWN.

>>>> --RAB25528.915844738/ns1.lyra.org
**** --rab25528.915844738/ns1.lyra.org: Command UNKNOWN.

>>>> The original message was received at Fri, 8 Jan 1999 17:18:56 -0800
**** the: Command UNKNOWN.

>>>> from localhost [127.0.0.1]
**** from: Command UNKNOWN.

>>>> ----- The following addresses had permanent fatal errors -----
**** -----: Command UNKNOWN.

>>>> <ch9517>
**** <ch9517>: Command UNKNOWN.

>>>> ----- Transcript of session follows -----
**** -----: Command UNKNOWN.

>>>> <<< RCPT TO: <ch9517>
**** <<<: Command UNKNOWN.

>>>> 550 <ch9517>... User unknown
**** 550: Command UNKNOWN.

>>>> 421 ns1.lyra.org Lost input channel from localhost [127.0.0.1]
**** 421: Command UNKNOWN.

>>>> ns1.lyra.org Lost input channel from localhost [127.0.0.1]
**** ns1.lyra.org: Command UNKNOWN.

>>>> --RAB25528.915844738/ns1.lyra.org
**** --rab25528.915844738/ns1.lyra.org: Command UNKNOWN.

>>>> Content-Type: message/delivery-status
**** content-type:: Command UNKNOWN.

>>>> Reporting-MTA: dns; ns1.lyra.org
**** reporting-mta:: Command UNKNOWN.

>>>> Received-From-MTA: DNS; localhost
**** received-from-mta:: Command UNKNOWN.

>>>> Arrival-Date: Fri, 8 Jan 1999 17:18:56 -0800
**** arrival-date:: Command UNKNOWN.

>>>> Final-Recipient: RFC822; ch9517@ns1.lyra.org
**** final-recipient:: Command UNKNOWN.

>>>> Action: failed
**** action:: Command UNKNOWN.

>>>> Status: 5.1.1
**** status:: Command UNKNOWN.

>>>> Last-Attempt-Date: Fri, 8 Jan 1999 17:18:58 -0800
**** last-attempt-date:: Command UNKNOWN.

>>>> --RAB25528.915844738/ns1.lyra.org--
**** --rab25528.915844738/ns1.lyra.org--: Command UNKNOWN.


--------------2F2A7A284F92DB72623DFFB--