[Mailman-Developers] about qrunner and locking
Marc MERLIN
marc_news@valinux.com
Thu, 7 Dec 2000 16:22:34 -0800
Summary since the mail is long:
I'm trying to find out why qrunner needs to lock a list before doing
delivery of a message. I'm getting corruption over NFS under high,
concurrent, load (I'm setting up and testing the new Sourceforge list
servers)
The basic question of this Email is: can I have qrunner ship Emails without
modifying the lists' config.db?
Longer version with explainations and details:
So, I setup that NFS shared mailman tree I was talking about a little while
ago. As a reminder:
/var/local/mailman is NFS exported
/var/local/mailman/qfiles is symlinked to ../mailman.local.
I then applied the two following patches to mailman:
--- mailman-2.0.orig/cron/qrunner Mon Sep 18 14:28:42 2000
+++ mailman-2.0/cron/qrunner Wed Dec 6 14:02:28 2000
@@ -96,7 +96,7 @@
import signal
signal.signal(signal.SIGCHLD, signal.SIG_DFL)
-QRUNNER_LOCK_FILE = os.path.join(mm_cfg.LOCK_DIR, 'qrunner.lock')
+QRUNNER_LOCK_FILE = os.path.join(mm_cfg.QUEUE_DIR, 'qrunner.lock')
LogStdErr('error', 'qrunner', manual_reprime=0, tee_to_stdout=0)
--- mailman-2.0.orig/Mailman/Logging/StampedLogger.py Mon Mar 20 22:25:58 2000
+++ mailman-2.0/Mailman/Logging/StampedLogger.py Wed Dec 6 16:20:03 2000
@@ -16,6 +16,7 @@
import os
import time
+import socket
from Logger import Logger
class StampedLogger(Logger):
@@ -66,7 +67,9 @@
label = "(%d)" % os.getpid()
else:
label = "%s(%d):" % (self.__label, os.getpid())
- prefix = stamp + label
+ hostname = socket.gethostname() + " "
+ prefix = stamp + hostname + label
Logger.write(self, "%s %s" % (prefix, msg))
if msg and msg[-1] == '\n':
self.__bol = 1
The plan here is to have two mailing list servers sharing the same list
configs, but running two different queues to avoid the mailman -> exim
bottleneck (I have 2 machines with 2 CPUs, and I don't want 3 CPUs idle
because a single qrunner is holding a global lock. Sure, it is rather fast
with exim, but since I have two machines (required for failover), I don't
really want one sitting idle, and want to do load balancing too)
To stress test everything, I sent 1000 local messages on each machine at the
same time, and while they were able to get their own qrunner locks, they had
to fight for the list lock (all messages were to the same list, which only
had one user).
Well, it was fast:
15:12:10 Start of injection of 1000 Emails on usw-sf-list1 and usw-sf-list2
15:14:06 1000 Emails accepted and queued on usw-sf-list1
15:14:11 1000 Emails accepted and queued on usw-sf-list2
15:19:07 usw-sf-list1's mailman shipped all the mails
15:19:37 usw-sf-list2's mailman shipped all the mails
Note that usw-sf-list2 was doing this over NFS, and was only marginally
slower considering this.
As expected, there was some corruption in the shared log files (both logs/smtp
and logs/post are missing about 10% of the lines they should have).
Now, what's less fun is this in my error logs:
Dec 07 15:14:39 2000 usw-sf-list2 (17221) test db file was corrupt, using fallback: /var/local/mailman/lists/test/config.db.last
Dec 07 15:14:40 2000 usw-sf-list2 (17285) test db file was corrupt, using fallback: /var/local/mailman/lists/test/config.db.last
Dec 07 15:14:40 2000 usw-sf-list2 (17282) test db file was corrupt, using fallback: /var/local/mailman/lists/test/config.db.last
Dec 07 15:14:40 2000 usw-sf-list2 post(17285): Traceback (innermost last):
usw-sf-list2 post(17285): File "/var/local/mailman/scripts/post", line 94, in ?
usw-sf-list2 post(17285): main()
usw-sf-list2 post(17285): File "/var/local/mailman/scripts/post", line 73, in main
usw-sf-list2 post(17285): mlist = MailList.MailList(listname, lock=0)
usw-sf-list2 post(17285): File "/var/local/mailman/Mailman/MailList.py", line 79, in __init__
Dec 07 15:14:40 2000 usw-sf-list2 (17291) test db file was corrupt, using fallback: /var/local/mailman/lists/test/config.db.last
usw-sf-list2 post(17285): self.Load()
usw-sf-list2 post(17285): File "/var/local/mailman/Mailman/MailList.py", line 908, in Load
usw-sf-list2 post(17285): shutil.copy(lastfile, dbfile)
usw-sf-list2 post(17285): File "/usr/lib/python1.5/shutil.py", line 52, in copy
usw-sf-list2 post(17285): copyfile(src, dst)
usw-sf-list2 post(17285): File "/usr/lib/python1.5/shutil.py", line 18, in copyfile
usw-sf-list2 post(17285): fdst = open(dst, 'wb')
usw-sf-list2 post(17285): IOError : [Errno 116] Stale NFS file handle: '/var/local/mailman/lists/test/config.db'
Dec 07 15:14:41 2000 usw-sf-list2 (17349) test db file was corrupt, using fallback: /var/local/mailman/lists/test/config.db.last
Dec 07 15:14:50 2000 usw-sf-list2 (17905) test db file was corrupt, using fallback: /var/local/mailman/lists/test/config.db.last
Dec 07 15:14:50 2000 usw-sf-list2 (17913) test db file was corrupt, using fallback: /var/local/mailman/lists/test/config.db.last
Dec 07 15:14:50 2000 usw-sf-list2 post(17905): Traceback (innermost last):
usw-sf-list2 post(17905): File "/var/local/mailman/scripts/post", line 94, in ?
usw-sf-list2 post(17905): main()
usw-sf-list2 post(17905): File "/var/local/mailman/scripts/post", line 73, in main
usw-sf-list2 post(17905): mlist = MailList.MailList(listname, lock=0)
usw-sf-list2 post(17905): File "/var/local/mailman/Mailman/MailList.py", line 79, in __init__
usw-sf-list2 post(17905): self.Load()
usw-sf-list2 post(17905): File "/var/local/mailman/Mailman/MailList.py", line 908, in Load
usw-sf-list2 post(17905): shutil.copy(lastfile, dbfile)
usw-sf-list2 post(17905): File "/usr/lib/python1.5/shutil.py", line 52, in copy
usw-sf-list2 post(17905): copyfile(src, dst)
usw-sf-list2 post(17905): File "/usr/lib/python1.5/shutil.py", line 18, in copyfile
usw-sf-list2 post(17905): fdst = open(dst, 'wb')
usw-sf-list2 post(17905): IOError : [Errno 116] Stale NFS file handle: '/var/local/mailman/lists/test/config.db'
Dec 07 15:14:50 2000 usw-sf-list2 (17922) test db file was corrupt, using fallback: /var/local/mailman/lists/test/config.db.last
Dec 07 15:15:44 2000 usw-sf-list2 (21410) test db file was corrupt, using fallback: /var/local/mailman/lists/test/config.db.last
Dec 07 15:15:44 2000 usw-sf-list2 (21419) test db file was corrupt, using fallback: /var/local/mailman/lists/test/config.db.last
and the fact that 3 mails (out of 2000) didn't make it to my mailbox.
I think I can live with the occasional log corruption (I can also lock the
log files before writing to them), but of course, mail loss is not as good.
I've looked at the qrunner code a bit, and I'm trying to understand why it
needs a lock o the list's config.db
I suppose NFS it to blame for this, and somehow, even though both machines
lock the test list, the locking is somehow not NFS safe (I thought it would
be though).
But then comes the question: why does qrunner have to modify the list's
config.db when it ships a message?
I suppose the relevant piece of code in qrunner is:
try:
keepqueued = dispose_message(mlist, msg, msgdata)
# Did the delivery generate child processes? Don't store them in
# the message data files.
kids = msgdata.get('_kids')
if kids:
allkids.update(kids)
del msgdata['_kids']
if not keepqueued:
# We're done with this message
dequeue(root)
but I have to admit to not understanding what it does.
Is there any way to have qrunner send messages without modifying the list
config and thus without having to lock the list either?
Thanks
Marc
--
Microsoft is to operating systems & security ....
.... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | Finger marc_f@merlins.org for PGP key