What happens with mailman after a crash
I have a question in regards to mailman's recovery abilities. Let's say mailman is running and sending out messages to a large list and the machine crashes or is rebooted, does mailman pickup where it stopped? Or is the run gone forever?
On Tue, 27 Jan 2004 04:39:24 -0800 "Somuchfun" <somuchfun@atlantismail.com> wrote:
I have a question in regards to mailman's recovery abilities. Let's say mailman is running and sending out messages to a large list and the machine crashes or is rebooted, does mailman pickup where it stopped? Or is the run gone forever?
I've had way too much experience with this lately... :(
Mailman soldiers on just fine. The problems we've run into are with some of mailman's files getting corrupted because they weren't synced to disk when the machine crashed -- the standard problem with any crash. I've had to rebuild a couple of list config databases, and toss out a few corrupted pending pickles. We plan to try the synchronous-write option, do backups (!), and maybe even replace the flakey hardware that's been causing the crashes.
-les
On Tue, 2004-01-27 at 13:26, Les Niles wrote:
On Tue, 27 Jan 2004 04:39:24 -0800 "Somuchfun" <somuchfun@atlantismail.com> wrote:
I have a question in regards to mailman's recovery abilities. Let's say mailman is running and sending out messages to a large list and the machine crashes or is rebooted, does mailman pickup where it stopped? Or is the run gone forever?
I've had way too much experience with this lately... :(
Mailman soldiers on just fine. The problems we've run into are with some of mailman's files getting corrupted because they weren't synced to disk when the machine crashed -- the standard problem with any crash. I've had to rebuild a couple of list config databases, and toss out a few corrupted pending pickles. We plan to try the synchronous-write option, do backups (!), and maybe even replace the flakey hardware that's been causing the crashes.
Doesn't the new SYNC_AFTER_WRITE flag address this issue? Here is the doc for it:
# This flag causes Mailman to fsync() its data files after writing and # flushing its contents. While this ensures the data is written to disk, # avoiding data loss, it may be a performance killer. Note that this flag # affects both message pickles and MailList config.pck files.
I just happen to remember a note from the Cyrus-IMAP docs that may help you, if you happen to be running linux with ext2. Quote: "LINUX SYSTEMS USING EXT2FS ONLY: Set the user, quota, and partition directories to update synchronously."
http://asg.web.cmu.edu/cyrus/download/imapd/install-configure.html
Quoting John Dennis <jdennis@redhat.com>:
On Tue, 2004-01-27 at 13:26, Les Niles wrote:
On Tue, 27 Jan 2004 04:39:24 -0800 "Somuchfun" <somuchfun@atlantismail.com> wrote:
I have a question in regards to mailman's recovery abilities. Let's say mailman is running and sending out messages to a large list and the machine crashes or is rebooted, does mailman pickup where it stopped? Or is the run gone forever?
I've had way too much experience with this lately... :(
Mailman soldiers on just fine. The problems we've run into are with some of mailman's files getting corrupted because they weren't synced to disk when the machine crashed -- the standard problem with any crash. I've had to rebuild a couple of list config databases, and toss out a few corrupted pending pickles. We plan to try the synchronous-write option, do backups (!), and maybe even replace the flakey hardware that's been causing the crashes.
Doesn't the new SYNC_AFTER_WRITE flag address this issue? Here is the doc for it:
On 27 Jan 2004 13:40:04 -0500 John Dennis <jdennis@redhat.com> wrote:
On Tue, 2004-01-27 at 13:26, Les Niles wrote:
Mailman soldiers on just fine. The problems we've run into are with some of mailman's files getting corrupted because they weren't synced to disk when the machine crashed -- the standard problem with any crash. I've had to rebuild a couple of list config databases, and toss out a few corrupted pending pickles. We plan to try the synchronous-write option, do backups (!), and maybe even replace the flakey hardware that's been causing the crashes.
Doesn't the new SYNC_AFTER_WRITE flag address this issue? Here is the doc for it:
# This flag causes Mailman to fsync() its data files after writing and # flushing its contents. While this ensures the data is written to disk, # avoiding data loss, it may be a performance killer. Note that this flag # affects both message pickles and MailList config.pck files.
Yes, that's what I meant by "synchronous-write option." We haven't turned it on yet, mostly because of the warning about performance, but partially out of laziness since we need to upgrade MM yet again.
-les
On Tue, 2004-01-27 at 13:40, John Dennis wrote:
Doesn't the new SYNC_AFTER_WRITE flag address this issue? Here is the doc for it:
# This flag causes Mailman to fsync() its data files after writing and # flushing its contents. While this ensures the data is written to disk, # avoiding data loss, it may be a performance killer. Note that this flag # affects both message pickles and MailList config.pck files.
Note that this warning /may/ be superstition. When I did some tests a long while ago, I saw something like a 95% hit in performance on ext3/RH9. Since then, I've been told that others have seen much less of a performance hit, and I've also heard that RH9 is particularly prone to performance problems when under heavy I/O.
It would be nice for folks out there to enable SYNC_AFTER_WRITE on heavy traffic sites and report back on performance. Maybe we should enable this option by default.
Also, I now know how to cut the number of files created and unlinked by Mailman in half. Currently, the qrunners create a .msg and .db file for every message in the queues. I can collapse that to one file, and I think I can do this while still maintaining the Python 2.1 compatibility requirement. I think the upgrade procedure will be fairly straightforward, so I'm seriously considering implementing this for Mailman 2.1.5. It's an important change, but it's mostly internal and I think it would be a big enough win to slip it into a bug fix release. There are other advantages, such as getting rid of those pesky "lost data files for filebase" messages.
-Barry
At 10:26 AM -0800 2004/01/27, Les Niles wrote:
I've had to rebuild a couple of list config
databases, and toss out a few corrupted pending pickles. We plan to try the synchronous-write option, do backups (!), and maybe even replace the flakey hardware that's been causing the crashes.
You could also change the filesystem where the mailman queue
directory is located, so as to remove any asynchronous or delayed writes that are used by the filesystem, above and beyond any mailman SYNC_AFTER_WRITE feature that may or may not be enabled. You could alternatively run on a journaling filesystem, which should also help.
-- Brad Knowles, <brad.knowles@skynet.be>
"They that can give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety." -Benjamin Franklin, Historical Review of Pennsylvania.
GCS/IT d+(-) s:+(++)>: a C++(+++)$ UMBSHI++++$ P+>++ L+ !E-(---) W+++(--) N+ !w--- O- M++ V PS++(+++) PE- Y+(++) PGP>+++ t+(+++) 5++(+++) X++(+++) R+(+++) tv+(+++) b+(++++) DI+(++++) D+(++) G+(++++) e++>++++ h--- r---(+++)* z(+++)
So what happens when the server is normally rebooted? Does mailman remember where it stopped? Does it start all over again? Or is the mailing just gone?
-----Original Message----- From: mailman-developers-bounces+somuchfun=atlantismail.com@python.o rg [mailto:mailman-developers-bounces+somuchfun=atlantismail.com@ python.org] On Behalf Of Brad Knowles Sent: Tuesday, January 27, 2004 11:10 AM To: les@2pi.org Cc: mailman-developers@python.org Subject: Re: [Mailman-Developers] What happens with mailman after a crash
At 10:26 AM -0800 2004/01/27, Les Niles wrote:
I've had to rebuild a couple of list config
databases, and toss out a few corrupted pending pickles. We plan to try the synchronous-write option, do backups (!), and maybe even replace the flakey hardware that's been causing the crashes.
You could also change the filesystem where the mailman queue directory is located, so as to remove any asynchronous or delayed writes that are used by the filesystem, above and beyond any mailman SYNC_AFTER_WRITE feature that may or may not be enabled. You could alternatively run on a journaling filesystem, which should also help.
-- Brad Knowles, <brad.knowles@skynet.be>
"They that can give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety." -Benjamin Franklin, Historical Review of Pennsylvania.
GCS/IT d+(-) s:+(++)>: a C++(+++)$ UMBSHI++++$ P+>++ L+ !E-(---) W+++(--) N+ !w--- O- M++ V PS++(+++) PE- Y+(++) PGP>+++ t+(+++) 5++(+++) X++(+++) R+(+++) tv+(+++) b+(++++) DI+(++++) D+(++) G+(++++) e++>++++ h--- r---(+++)* z(+++)
Mailman-Developers mailing list Mailman-Developers@python.org http://mail.python.org/mailman/listinfo/mailman-developers
On Tue, 2004-01-27 at 17:32, Somuchfun wrote:
So what happens when the server is normally rebooted? Does mailman remember where it stopped? Does it start all over again? Or is the mailing just gone?
If you run "mailmanctl stop" first, as should happen when you change run levels if you've installed the mailman init script, the qrunners will get a Python exception, which they'll catch. That should cause them to re-queue the files and restart where they left off.
-Barry
I don't know about your system, but at least on mine, Mailman/Python are only working for a few seconds. The rest of the time the MTA is busy sending out all the mail, and -that- is where you hope no problems are when you reboot. Generally, the current MTAs are pretty good at handling this.
Bob
Barry Warsaw wrote:
On Tue, 2004-01-27 at 17:32, Somuchfun wrote:
So what happens when the server is normally rebooted? Does mailman remember where it stopped? Does it start all over again? Or is the mailing just gone?
If you run "mailmanctl stop" first, as should happen when you change run levels if you've installed the mailman init script, the qrunners will get a Python exception, which they'll catch. That should cause them to re-queue the files and restart where they left off.
-Barry
Mailman-Developers mailing list Mailman-Developers@python.org http://mail.python.org/mailman/listinfo/mailman-developers
.
On Thu, 2004-01-29 at 00:15, Bob Puff@NLE wrote:
I don't know about your system, but at least on mine, Mailman/Python are only working for a few seconds. The rest of the time the MTA is busy sending out all the mail, and -that- is where you hope no problems are when you reboot. Generally, the current MTAs are pretty good at handling this.
Yep.
-Barry
On Tue, 2004-01-27 at 07:39, Somuchfun wrote:
I have a question in regards to mailman's recovery abilities. Let's say mailman is running and sending out messages to a large list and the machine crashes or is rebooted, does mailman pickup where it stopped? Or is the run gone forever?
If Mailman is in the middle of delivering a message and is killed uncleanly, e.g. Python crashes, or the machine hard panics, then the current run is lost. If Mailman is stopped cleanly via 'mailmanctl stop', then it's current place is remembered and resumed on restart.
I'd like to do better than this, but I think it's infeasible with the current qrunner architecture, since the Switchboard removes the files when they are dequeued for processing. It seems to me the alternatives are to either risk duplicate deliveries for some subset of recipients, or really clobber performance by writing status information out after each successful recipient delivery.
Of course, you'd hope that the window of opportunity for message loss in Mailman is small, if it can hand off all the recipient chunks to the MTA quickly. Then the mail server's guarantees take over.
-Barry
participants (7)
-
Barry Warsaw
-
Bob Puff@NLE
-
Brad Knowles
-
Jeff Warnica
-
John Dennis
-
Les Niles
-
Somuchfun