[Mailman-Users] stuck bounce-events write

Mark Sapiro mark at msapiro.net
Tue Nov 22 08:54:41 CET 2011


Richard Haas wrote:
>
>Last week, we ran into a rather odd event (first time we've seen it in a
>decade or so of Mailman ops), where a bounced message kept writing to its
>/var/mailman/data/bounce-events-PID.pck file, until it filled the disk [...]
>The first bounce seemed normal:
>
>Delivery failed for the following reason:
>Server entsrmbx01.teledyne.com[74.201.16.49] failed with: 550 #5.1.0
>Address rejected <someone>@teledyne.com
>This has been a permanent failure.  No further delivery attempts will be
>made.
>
>
>... and then the first bounce was written to the
>/var/mailman/data/bounce-events pck file at:
>
>Received from localhost (localhost [127.0.0.1])
>       by [myserver] (Postfix) with ESMTP id 3272D601E21
>       for <[somelist]-news-bounces at myserver>; Wed, 16 Nov 2011 16:41:55
>-0500 (EST)
>
>
>... but was written another 5000 times with it's PDF attachment.
>
>... at 17:43 local, the disk filled with:
>
>Nov 16 17:43:13 [hostname] postfix/smtpd[27239]: NOQUEUE: reject: MAIL from
>localhost[127.0.0.1]: 452 4.3.1 Insufficient system storage; proto=ESMTP
>helo=<localhost>
>Nov 16 17:43:13 [hostname] postfix/smtpd[27239]: warning: not enough free
>space in mail queue: 36216832 bytes < 1.5*message size limit


So is it Postfix that is repeatedly delivering this bounce message to
Mailman, and Mailman is just trying to deal with it? If so, the
underlying issue is with Postfix, not Mailman.


>Oddly, the bounce-events file's last write was considerably earlier:
>
>-rw-rw-r-- 1 mailman mailman 37867700224 2011-11-16 16:52
>bounce-events-01556.pck


This is an orphaned bounce-events file that was left behind when
BounceRunner PID 1556 died - see below.


>The only questionable log entries I can find after/around the message
>posting is for an abort of the Archrunner and Bouncerunner before the disk
>filled:
>
>Nov 16 17:06:08 2011 (1533) Master qrunner detected subprocess exit
>(pid: 1555, sig: 9, sts: None, class: ArchRunner, slice: 1/1) [restarting]
>Nov 16 17:06:14 2011 (27568) ArchRunner qrunner started.
>Nov 16 17:06:18 2011 (1533) Master qrunner detected subprocess exit
>(pid: 1556, sig: 9, sts: None, class: BounceRunner, slice: 1/1) [restarting]
>Nov 16 17:06:20 2011 (27573) BounceRunner qrunner started.


BounceRunner PID 1556 died at 17:06 possibly while processing the
bounces in qfiles/bounces or while processing the bounce-events file.
This is within the normal 15 minute window for processing
bounce-events.

Where there any error log entries associated with the deaths of
BounceRunner or ArchRunner?


>Since the [somelist]-news list in question doesn't have archiving enabled
>(never has), so I the ArchRunner restart. There were no other qrunner
>restarts since 6:33 that morning, and the only later ones were when admin
>staff went in to clear things up.
>
>So chronologically:
>
>- 16:41 first bounce occurs and is written to bounce-events by PID 1556
>- 16:52 PID 1556 makes (according to the file system) its last write to
>bounce-events-1556.pck
>- 17:02 ArchRunner and BounceRunner get restart signals, PID 27573 is the
>new BounceRunner
>- 17:43 the disk fills to the point postfix starts complaining
>
>
>... that doesn't add up unless one presumes PID 1556 hung around and kept
>writing bounce-events-1556.pck from 17:02 to 17:43 (and probably beyond).


PID 1556 was gone at 17:06:18. It is not clear whether the repeated
bounce events were due to BounceRunner in some kind of loop or
BounceRunner actually receiving bounces from Postfix.

Are there Postfix log entries for the delivery of 5000 bounces to
Mailman? If so, it would seem Postfix is the culprit. If not, then it
would seem to be BounceRunner. I don't have a clue as to why
BounceRunner would do this.

-- 
Mark Sapiro <mark at msapiro.net>        The highway is for gamblers,
San Francisco Bay Area, California    better use your sense - B. Dylan



More information about the Mailman-Users mailing list