Mailman 3 stuck bounce-events write - Mailman-Users

21 Nov 2011


      Greetings.
Last week, we ran into a rather odd event (first time we've seen it in a
decade or so of Mailman ops), where a bounced message kept writing to its
/var/mailman/data/bounce-events-PID.pck file, until it filled the disk --
it wasn't a series of bounces (same delivery times on every received
header), it was the same bounce written over and over again (about 5000
times), unfortunately, that bounce contained a 7 MB base64 encoded
attachment.
I thought I'd toss the details out here, to see if anyone had a thought on
what caused this, or (other than implementing a disk watching cron job,
which we already have), how to avoid it in the future ... though it
certainly seems like a fluke.
Anyhow, the message was first sent to the list at 16:41:35 (after manual
list owner approval of the moderated posting), and was posted normally at:
/var/mailman/logs/post:Nov 16 16:41:35 2011 (1560) post to [somelist]-news
from [someone]@my.org size=7705563,
message-id=,
success
The message had a PDF attachment, and was from the list owner. The list had
about 120 members, and most of the members (there were some other
undeliverable addresses) got the message. The list owner got the message.
There appeared to be nothing bogus about the attachment.
The first bounce seemed normal:
Delivery failed for the following reason:
Server entsrmbx01.teledyne.com[74.201.16.49] failed with: 550 #5.1.0
Address rejected <someone>@teledyne.com
This has been a permanent failure.  No further delivery attempts will be
made.
... and then the first bounce was written to the
/var/mailman/data/bounce-events pck file at:
Received from localhost (localhost [127.0.0.1])
by [myserver] (Postfix) with ESMTP id 3272D601E21
for <[somelist]-news-bounces@myserver>; Wed, 16 Nov 2011 16:41:55
-0500 (EST)
... but was written another 5000 times with it's PDF attachment.
... at 17:43 local, the disk filled with:
Nov 16 17:43:13 [hostname] postfix/smtpd[27239]: NOQUEUE: reject: MAIL from
localhost[127.0.0.1]: 452 4.3.1 Insufficient system storage; proto=ESMTP
helo=<localhost>
Nov 16 17:43:13 [hostname] postfix/smtpd[27239]: warning: not enough free
space in mail queue: 36216832 bytes < 1.5*message size limit
Oddly, the bounce-events file's last write was considerably earlier:
-rw-rw-r-- 1 mailman mailman 37867700224 2011-11-16 16:52
bounce-events-01556.pck
The only questionable log entries I can find after/around the message
posting is for an abort of the Archrunner and Bouncerunner before the disk
filled:
Nov 16 17:06:08 2011 (1533) Master qrunner detected subprocess exit
(pid: 1555, sig: 9, sts: None, class: ArchRunner, slice: 1/1) [restarting]
Nov 16 17:06:14 2011 (27568) ArchRunner qrunner started.
Nov 16 17:06:18 2011 (1533) Master qrunner detected subprocess exit
(pid: 1556, sig: 9, sts: None, class: BounceRunner, slice: 1/1) [restarting]
Nov 16 17:06:20 2011 (27573) BounceRunner qrunner started.
Since the [somelist]-news list in question doesn't have archiving enabled
(never has), so I the ArchRunner restart. There were no other qrunner
restarts since 6:33 that morning, and the only later ones were when admin
staff went in to clear things up.
So chronologically:

16:41 first bounce occurs and is written to bounce-events by PID 1556
16:52 PID 1556 makes (according to the file system) its last write to
bounce-events-1556.pck
17:02 ArchRunner and BounceRunner get restart signals, PID 27573 is the
new BounceRunner
17:43 the disk fills to the point postfix starts complaining

... that doesn't add up unless one presumes PID 1556 hung around and kept
writing bounce-events-1556.pck from 17:02 to 17:43 (and probably beyond).
Does the above add up to anything for anyone else? Is there something else
that should be checked?
Thanks for any insight.
--
Richard Haas rhaas@rhaas.us
GnuPG public key ID: 1CB7F0E2

stuck bounce-events write

Richard Haas

-- Richard Haas rhaas@rhaas.us GnuPG public key ID: 1CB7F0E2

tags

participants (3)