I've reported this before, and I will reiterate it again. Moreover, I would call it a "shipstopper" for the 1.0 release.
I'm using 1.0b6 here. This has hit me FIVE times in the past 24 hours. Each time, I have to take manual action to fix things (as reported before). Basically, shutting down the sendmail listener to stop the mail loop, clearing the Mailman queue and sendmail queue of the loop-inducing garbage, and then bringing it back up.
What happens is that somebody types in a request to the Mailman subscription page with a *local* mail address (i.e. no domain name). The confirmation email then gets sent. This bounces back to Mailman, but it doesn't understand that it is a bounce and tries to process the dumb thing. That fails massively (as seen in the attached mail), and responds to mailer-daemon with the error. More on this in a bit.
What appears wonky is that Mailman seems to attempt to deliver the confirmation over and over, nonstop.
Here is a portion of the mail log for RAB25492:
Jan 8 17:18:19 ns1 sendmail[25492]: RAA25492: <ch9517>... User unknown Jan 8 17:18:19 ns1 sendmail[25492]: lost input channel from localhost [127.0.0.1] Jan 8 17:18:19 ns1 sendmail[25492]: RAA25492: from=<hognews-request@eastsideharley.com>, size=0, class=0, pri=0, nrcpts=0, proto=SMTP, relay=localhost [127.0.0.1] Jan 8 17:18:19 ns1 sendmail[25492]: RAA25492: RAB25492: DSN: <ch9517>... User unknown Jan 8 17:18:27 ns1 sendmail[25492]: RAB25492: to=|"/home/mailman/install/mail/wrapper mailcmd hognews", delay=00:00:08, xdelay=00:00:07, mailer=prog, stat=Sent
Note the "lost input channel". I don't think that is right. The cycle above repeats at 17:18:32. This goes on until I kill it.
In the above case, the mailer-deamon is responding to Mailman (the last line). Mailman then processes that mail, and returns a message like the attached garbage to root. I just deleted 1400 messages from my root mailbox.
Some more information:
In logs/error, I see the following traceback repeated every 30 minutes:
Jan 08 11:12:07 1999 smtplib: Traceback (innermost last): smtplib: File "/home/mailman/install/cron/run_queue", line 31, in ? smtplib: OutgoingQueue.processQueue() smtplib: File "/home/mailman/install/Mailman/OutgoingQueue.py", line 38, in processQueue smtplib: Utils.TrySMTPDelivery(recip,sender,text,full_fname) smtplib: File "/home/mailman/install/Mailman/Utils.py", line 201, in TrySMTPDelivery smtplib: con.send(to=recipient,frm=sender,text=text) smtplib: File "/home/mailman/install/Mailman/smtplib.py", line 75, in send smtplib: self.getresp() smtplib: File "/home/mailman/install/Mailman/smtplib.py", line 147, in getresp smtplib: raise bad, resp smtplib: smtplib.error_proto : 550 <bzkeffer>... User unknown
I suspect that the "lost input channel" further above is due to a similar condition. The traceback drops the socket, fails to log anything, and fails to clear the confirmation email from the outgoing queue. Mailman processes the error response from mailer-daemon and drops it into the queue. It runs the queue which processes that response (flooding the root email box) along with the confirmation (again), which starts the loop again.
I also believe that I now understand why my machine died back in August due to this bug, but that hasn't happened to me recently. Because of the queueing nature of Mailman, there is only one loop occurring at a time. HOWEVER: if that cron job wakes up and processes the queue, then it starts a second, simultaneous loop. If enough of those 30-minute cron-caused loop-creations occurs, then your machine is completely dead as it thrashes, trying to process all those loops. I've been catching them relatively soon in this past spate of them (although I did miss one last night for a long while, causing my loadavg to hit 15... heh. it's also a DNS server and its response became SLOWWWWWW... which led me to look into it)
So, I see one or more errors here:
- a possible traceback not being logged during Q processing
- said traceback fouling the outgoing mail queue processing
- should probably disallow domain-less addresses in all places
- bounce detection needs to recognize the mail that caused the attached response (read between the lines for the content)
And because this can swamp a box, I would highly recommend it gets fixed :-)
thx -g
-- Greg Stein, http://www.lyra.org/
"GS" == Greg Stein <gstein@lyra.org> writes:
GS> What happens is that somebody types in a request to the
GS> Mailman subscription page with a *local* mail address (i.e. no
GS> domain name). The confirmation email then gets sent. This
GS> bounces back to Mailman, but it doesn't understand that it is
GS> a bounce and tries to process the dumb thing. That fails
GS> massively (as seen in the attached mail), and responds to
GS> mailer-daemon with the error. More on this in a bit.
GS> 3) should probably disallow domain-less addresses in all
GS> places
Greg,
It should now (with 1.0b7+) be impossible to subscribe a domainless email address. I've verified this with the Membership Management interface, a normal user's use of the subscribe page, attempting to subscribe via email directly, and using bin/add_members.
If this means that from now on, no one will encounter the problem, I'm tempted not to investigate further.
My question to the developers is, how important do you think it is to fix this problem for the early adopters? You guys can `just' delete the unqualified addresses and force those people to re-subscribe using a now valid address. Would this be a huge burden?
Of course, if anybody's seen this happen for qualified addresses, then obviously we need to debug it.
-Barry
participants (2)
-
Barry A. Warsaw
-
Greg Stein