Mailman redundancy setup
Hi
I am investigating doing a redundant setup of a mailman installation. We do not care about load balancing, only fail over. I have read through all I can found about this, including the following article on the FAQ: http://wiki.list.org/pages/viewpage.action?pageId=4030621
I am planning to solve this by having an ordinary passive - active with a shared ip number between the hosts. My idea is to have mailman installations on both servers, but point out an NFS area for storing the data that needs to be available on all nodes in the setup.
The big difference between failover and load balancing (described in the FAQ post that I linked to) is of course that whenever there is a fail over, there is a good chance that the other mailman instance exited abruptly, leaving lock files and temporary data. This makes an argument for storing lock files locally. On the other hand, we do not want a situation where data gets corrupted even if both hosts end up running simultaneously. I have some questions, perhaps someone here has an answer:
If we store lock files on an NFS area and one node goes down without removing any locks, will the other node be able to start mailman?
According to http://wiki.list.org/pages/viewpage.action?pageId=4030621, qrunner does not use locks. However, I see a file named "master-qrunner" in the locks directory. Suppose that our master node went down without removing any lock files, would the master-qrunner file cause the qrunner on the other node to not start?
On the page http://wiki.list.org/pages/viewpage.action?pageId=4030621 they mention that "If you set up /usr/local/mailman/qfiles to be shared across NFS and don't set up dedicated slices for each group of queue runners, you will be SERIOUSLY SCREWED." In what way?
Say that we store locks locally, store data on an NFS node and accidentally start two instances of mailman on two different machines. Do we risk data corruption, or is the worst thing that can happen that changes are overwritten?
Is there some documentation that I have not found about how locks are used in Mailman?
BR, David Westlund
David Westlund
- If we store lock files on an NFS area and one node goes down without removing any locks, will the other node be able to start mailman?
It depends. If you run mailmanctl with the -s or --stale-lock-cleanup option, it will remove the other master's lock. Note that mailmanctl --help implies it will do this only if the process with the lock is not running, but it doesn't really check this.
- According to http://wiki.list.org/pages/viewpage.action?pageId=4030621, qrunner does not use locks. However, I see a file named "master-qrunner" in the locks directory. Suppose that our master node went down without removing any lock files, would the master-qrunner file cause the qrunner on the other node to not start?
The qrunner processes do not use locks but the master (mailmanctl) does. See answer above.
- On the page http://wiki.list.org/pages/viewpage.action?pageId=4030621 they mention that "If you set up /usr/local/mailman/qfiles to be shared across NFS and don't set up dedicated slices for each group of queue runners, you will be SERIOUSLY SCREWED." In what way?
If you have two qrunners processing the same (or overlapping) slices of the same queue, they step on each other's toes. Both runners make a list of the candidate entries in their slice of the queue. One runner starts to process an entry and renames it from *.pck to *.bak. The second runner attempts to process the same entry and finds it missing. This causes the second runner to log a misleading error and attempt to remove and perhaps preserve the *.bak file the first runner is processing. One or the other runner will be unable to remove the *.bak file and will log another error.
Further, if the second runner removes the *.bak and the first runner dies for some reason, there will be no recovery file and the message will be lost.
- Say that we store locks locally, store data on an NFS node and accidentally start two instances of mailman on two different machines. Do we risk data corruption, or is the worst thing that can happen that changes are overwritten?
You will have all the issues described in the previous answer due to multiple qrunners processing the same slices of the same queues.
In addition, you will potentially have lists being updated concurrently by the separate hosts. I think that the worst that can happen is that one hosts changes will be lost. The two hosts will write their changes to separate temporary files. The temp files have hostname and PID as part of their names. Thus, the temp files should each be good although they will not contain the other host's changes. Then each process does the sequence 1) remove config.pck.last, 2) rename config.pck to config.pck.last, 3) rename config.pck.tmp.xxx to config.pck. Ordinarily, the changes from the host that does step 3 first will be lost.
It would be possible for one host to try to instantiate a list and read the config.pck in between steps 2 and 3 by the other host when the config.pck isn't there. This should just cause it to fall back to the config.pck.last, but again will result in the changes in the other host's temp file being lost.
- Is there some documentation that I have not found about how locks are used in Mailman?
Have you seen <http://wiki.list.org/x/noA9> which describes what lock files look like?
There are only two kinds of locks in Mailman. the master qrunner (mailmanctl) lock which is intended to prevent Mailman from being started twice, but which is defeated by the -s|--stale-lock-cleanup option, and list locks. List locks prevent concurrent updates to list objects and a list's archives.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
participants (2)
-
David Westlund
-
Mark Sapiro