[Mailman-Developers] about qrunner and locking

Marc MERLIN marc_news@valinux.com
Thu, 7 Dec 2000 16:22:34 -0800


Summary since the mail is long:
I'm  trying to  find out  why  qrunner needs  to  lock a  list before  doing
delivery  of  a  message. I'm  getting   corruption  over  NFS  under  high,
concurrent,  load (I'm  setting  up  and testing  the  new Sourceforge  list
servers)
The basic question of this Email is:  can I have qrunner ship Emails without
modifying the lists' config.db?


Longer version with explainations and details:

So, I setup that NFS shared mailman tree I was talking about a little while
ago. As a reminder:
/var/local/mailman is NFS exported
/var/local/mailman/qfiles is symlinked to ../mailman.local.

I then applied the two following patches to mailman:
--- mailman-2.0.orig/cron/qrunner       Mon Sep 18 14:28:42 2000
+++ mailman-2.0/cron/qrunner    Wed Dec  6 14:02:28 2000
@@ -96,7 +96,7 @@
 import signal
 signal.signal(signal.SIGCHLD, signal.SIG_DFL)
 
-QRUNNER_LOCK_FILE = os.path.join(mm_cfg.LOCK_DIR, 'qrunner.lock')
+QRUNNER_LOCK_FILE = os.path.join(mm_cfg.QUEUE_DIR, 'qrunner.lock')
 
 LogStdErr('error', 'qrunner', manual_reprime=0, tee_to_stdout=0)

--- mailman-2.0.orig/Mailman/Logging/StampedLogger.py   Mon Mar 20 22:25:58 2000
+++ mailman-2.0/Mailman/Logging/StampedLogger.py        Wed Dec  6 16:20:03 2000
@@ -16,6 +16,7 @@
 
 import os
 import time
+import socket
 from Logger import Logger
 
 class StampedLogger(Logger):
@@ -66,7 +67,9 @@
                 label = "(%d)" % os.getpid()
             else:
                 label = "%s(%d):" % (self.__label, os.getpid())
-            prefix = stamp + label
+           hostname = socket.gethostname() + " "
+            prefix = stamp + hostname + label
         Logger.write(self, "%s %s" % (prefix, msg))
         if msg and msg[-1] == '\n':
             self.__bol = 1


The plan  here is  to have two  mailing list servers  sharing the  same list
configs,  but running  two different  queues to  avoid the  mailman ->  exim
bottleneck (I  have 2 machines  with 2  CPUs, and I  don't want 3  CPUs idle
because a single  qrunner is holding a global lock. Sure,  it is rather fast
with exim,  but since I have  two machines (required for  failover), I don't
really want one sitting idle, and want to do load balancing too)

To stress test everything, I sent 1000 local messages on each machine at the
same time, and while they were able to get their own qrunner locks, they had
to fight for the  list lock (all messages were to the  same list, which only
had one user).
Well, it was fast:
15:12:10 Start of injection of 1000 Emails on usw-sf-list1 and usw-sf-list2
15:14:06 1000 Emails accepted and queued on usw-sf-list1 
15:14:11 1000 Emails accepted and queued on usw-sf-list2
15:19:07 usw-sf-list1's mailman shipped all the mails
15:19:37 usw-sf-list2's mailman shipped all the mails

Note that  usw-sf-list2 was  doing this  over NFS,  and was  only marginally
slower considering this.

As expected, there was some corruption in the shared log files (both logs/smtp
and logs/post are missing about 10% of the lines they should have).

Now, what's less fun is this in my error logs:
Dec 07 15:14:39 2000 usw-sf-list2 (17221) test db file was corrupt, using fallback: /var/local/mailman/lists/test/config.db.last
Dec 07 15:14:40 2000 usw-sf-list2 (17285) test db file was corrupt, using fallback: /var/local/mailman/lists/test/config.db.last
Dec 07 15:14:40 2000 usw-sf-list2 (17282) test db file was corrupt, using fallback: /var/local/mailman/lists/test/config.db.last
Dec 07 15:14:40 2000 usw-sf-list2 post(17285): Traceback (innermost last):
usw-sf-list2 post(17285):   File "/var/local/mailman/scripts/post", line 94, in ?
usw-sf-list2 post(17285):      main()
usw-sf-list2 post(17285):   File "/var/local/mailman/scripts/post", line 73, in main
usw-sf-list2 post(17285):      mlist = MailList.MailList(listname, lock=0)
usw-sf-list2 post(17285):   File "/var/local/mailman/Mailman/MailList.py", line 79, in __init__
Dec 07 15:14:40 2000 usw-sf-list2 (17291) test db file was corrupt, using fallback: /var/local/mailman/lists/test/config.db.last
usw-sf-list2 post(17285):      self.Load()
usw-sf-list2 post(17285):   File "/var/local/mailman/Mailman/MailList.py", line 908, in Load
usw-sf-list2 post(17285):      shutil.copy(lastfile, dbfile)
usw-sf-list2 post(17285):   File "/usr/lib/python1.5/shutil.py", line 52, in copy
usw-sf-list2 post(17285):      copyfile(src, dst)
usw-sf-list2 post(17285):   File "/usr/lib/python1.5/shutil.py", line 18, in copyfile
usw-sf-list2 post(17285):      fdst = open(dst, 'wb')
usw-sf-list2 post(17285): IOError :  [Errno 116] Stale NFS file handle: '/var/local/mailman/lists/test/config.db' 
Dec 07 15:14:41 2000 usw-sf-list2 (17349) test db file was corrupt, using fallback: /var/local/mailman/lists/test/config.db.last
Dec 07 15:14:50 2000 usw-sf-list2 (17905) test db file was corrupt, using fallback: /var/local/mailman/lists/test/config.db.last
Dec 07 15:14:50 2000 usw-sf-list2 (17913) test db file was corrupt, using fallback: /var/local/mailman/lists/test/config.db.last
Dec 07 15:14:50 2000 usw-sf-list2 post(17905): Traceback (innermost last):
usw-sf-list2 post(17905):   File "/var/local/mailman/scripts/post", line 94, in ?
usw-sf-list2 post(17905):      main()
usw-sf-list2 post(17905):   File "/var/local/mailman/scripts/post", line 73, in main
usw-sf-list2 post(17905):      mlist = MailList.MailList(listname, lock=0)
usw-sf-list2 post(17905):   File "/var/local/mailman/Mailman/MailList.py", line 79, in __init__
usw-sf-list2 post(17905):      self.Load()
usw-sf-list2 post(17905):   File "/var/local/mailman/Mailman/MailList.py", line 908, in Load
usw-sf-list2 post(17905):      shutil.copy(lastfile, dbfile)
usw-sf-list2 post(17905):   File "/usr/lib/python1.5/shutil.py", line 52, in copy
usw-sf-list2 post(17905):      copyfile(src, dst)
usw-sf-list2 post(17905):   File "/usr/lib/python1.5/shutil.py", line 18, in copyfile
usw-sf-list2 post(17905):      fdst = open(dst, 'wb')
usw-sf-list2 post(17905): IOError :  [Errno 116] Stale NFS file handle: '/var/local/mailman/lists/test/config.db' 
Dec 07 15:14:50 2000 usw-sf-list2 (17922) test db file was corrupt, using fallback: /var/local/mailman/lists/test/config.db.last
Dec 07 15:15:44 2000 usw-sf-list2 (21410) test db file was corrupt, using fallback: /var/local/mailman/lists/test/config.db.last
Dec 07 15:15:44 2000 usw-sf-list2 (21419) test db file was corrupt, using fallback: /var/local/mailman/lists/test/config.db.last

and the fact that 3 mails (out of 2000) didn't make it to my mailbox.


I think I can  live with the occasional log corruption (I  can also lock the
log files before writing to them), but of course, mail loss is not as good.
I've looked at the  qrunner code a bit, and I'm trying  to understand why it
needs a lock o the list's config.db
I suppose NFS it  to blame for this, and somehow,  even though both machines
lock the test list, the locking is  somehow not NFS safe (I thought it would
be though).
But then  comes the  question: why  does qrunner have  to modify  the list's
config.db when it ships a message?
I suppose the relevant piece of code in qrunner is:
        try:
            keepqueued = dispose_message(mlist, msg, msgdata)
            # Did the delivery generate child processes?  Don't store them in
            # the message data files.
            kids = msgdata.get('_kids')
            if kids:
                allkids.update(kids)
                del msgdata['_kids']
            if not keepqueued:
                # We're done with this message
                dequeue(root)


but I have to admit to not understanding what it does.

Is there  any way to have  qrunner send messages without  modifying the list
config and thus without having to lock the list either?

Thanks
Marc
-- 
Microsoft is to operating systems & security ....
                                      .... what McDonalds is to gourmet cooking
  
Home page: http://marc.merlins.org/   |   Finger marc_f@merlins.org for PGP key