[Mailman-Users] Mailman redundancy setup

Thu Oct 25 22:47:09 CEST 2012

David Westlund 
>
>* If we store lock files on an NFS area and one node goes down without removing any locks, will the other node be able to start mailman?

It depends. If you run mailmanctl with the -s or --stale-lock-cleanup
option, it will remove the other master's lock. Note that mailmanctl
--help implies it will do this only if the process with the lock is
not running, but it doesn't really check this.

>* According to http://wiki.list.org/pages/viewpage.action?pageId=4030621, qrunner does not use locks. However, I see a file named "master-qrunner" in the locks directory. Suppose that our master node went down without removing any lock files, would the master-qrunner file cause the qrunner on the other node to not start?

The qrunner processes do not use locks but the master (mailmanctl)
does. See answer above.

>* On the page http://wiki.list.org/pages/viewpage.action?pageId=4030621 they mention that "If you set up /usr/local/mailman/qfiles to be shared across NFS and don't set up dedicated slices for each group of queue runners, you will be SERIOUSLY SCREWED." In what way?

If you have two qrunners processing the same (or overlapping) slices of
the same queue, they step on each other's toes. Both runners make a
list of the candidate entries in their slice of the queue. One runner
starts to process an entry and renames it from *.pck to *.bak. The
second runner attempts to process the same entry and finds it missing.
This causes the second runner to log a misleading error and attempt to
remove and perhaps preserve the *.bak file the first runner is
processing. One or the other runner will be unable to remove the *.bak
file and will log another error.

Further, if the second runner removes the *.bak and the first runner
dies for some reason, there will be no recovery file and the message
will be lost.

>* Say that we store locks locally, store data on an NFS node and accidentally start two instances of mailman on two different machines. Do we risk data corruption, or is the worst thing that can happen that changes are overwritten?

You will have all the issues described in the previous answer due to
multiple qrunners processing the same slices of the same queues.

In addition, you will potentially have lists being updated concurrently
by the separate hosts. I think that the worst that can happen is that
one hosts changes will be lost. The two hosts will write their changes
to separate temporary files. The temp files have hostname and PID as
part of their names. Thus, the temp files should each be good although
they will not contain the other host's changes. Then each process does
the sequence 1) remove config.pck.last, 2) rename config.pck to
config.pck.last, 3) rename config.pck.tmp.xxx to config.pck.
Ordinarily, the changes from the host that does step 3 first will be
lost.

It would be possible for one host to try to instantiate a list and read
the config.pck in between steps 2 and 3 by the other host when the
config.pck isn't there. This should just cause it to fall back to the
config.pck.last, but again will result in the changes in the other
host's temp file being lost.

>* Is there some documentation that I have not found about how locks are used in Mailman?

Have you seen <http://wiki.list.org/x/noA9> which describes what lock
files look like?

There are only two kinds of locks in Mailman. the master qrunner
(mailmanctl) lock which is intended to prevent Mailman from being
started twice, but which is defeated by the -s|--stale-lock-cleanup
option, and list locks. List locks prevent concurrent updates to list
objects and a list's archives.

-- 
Mark Sapiro <mark at msapiro.net>        The highway is for gamblers,
San Francisco Bay Area, California    better use your sense - B. Dylan