I'm not sure if this is appropriate here, so apologies in advance if it isn't... We're using Mailman 2.0.8 on a sun ultra with solaris 8, and we were running into two problems with qrunner... The first was a race condition where this code in qrunner - # If the .db file has no corresponding .msg file, we might as well - # remove the .db file, since there's little we can do about it. - if not os.path.exists(root+'.msg'): - syslog('qrunner', 'Unlinking orphaned .db file: %s.db' % root) - os.unlink(root+'.db') was unlinking .db files before their corresponding .msg files were created. We just removed those lines and it seems to be ok -- no orphaned .db files and no more orphaned .msg files. The second problem was that qrunner was crashing on some kinds of bounce messages (?), causing messages to pile up in the queue occasionally. It was generating errors like this:
Traceback (most recent call last): File "/usr/local/mailman/cron/qrunner", line 277, in ? kids = main(lock) File "/usr/local/mailman/cron/qrunner", line 247, in main keepqueued = dispose_message(mlist, msg, msgdata) File "/usr/local/mailman/cron/qrunner", line 121, in dispose_message if BouncerAPI.ScanMessages(mlist, mimemsg): File "/usr/local/mailman/Mailman/Bouncers/BouncerAPI.py", line 59, in ScanMessages addrs = func(msg) File "/usr/local/mailman/Mailman/Bouncers/DSN.py", line 46, in process if string.lower(msg.gettype()) <> 'multipart/report' or \ File "/usr/local/lib/python2.1/string.py", line 51, in lower return s.lower() AttributeError : 'None' object has no attribute 'lower'
The patch below seems to fix the problem (?)... we're now seeing some errors like this: bounce loop detected from: mailer-daemon@web.net but at least qrunner keeps running and the queue is getting cleared. - Rob -- Rob Ellis <rob@web.ca> System Administrator, Web Networks * --- Mailman/Bouncers/DSN.py 14 Nov 2001 15:57:09 -0000 1.1.1.2 +++ Mailman/Bouncers/DSN.py 7 Mar 2002 16:29:02 -0000 1.2 @@ -43,7 +43,7 @@ def process(msg): - ctype = msg.gettype() + ctype = msg.gettype() or '' param = msg.getparam('report-type') or '' if string.lower(ctype) <> 'multipart/report' or \ string.lower(param) <> 'delivery-status':
"RE" == Rob Ellis <rob@web.ca> writes:
RE> I'm not sure if this is appropriate here, so apologies in RE> advance if it isn't... Yup, this is the right place. RE> We're using Mailman 2.0.8 on a sun ultra with solaris 8, RE> and we were running into two problems with qrunner... RE> The first was a race condition where this code in qrunner RE> - # If the .db file has no corresponding .msg file, we might RE> as well - # remove the .db file, since there's little we can RE> do about it. - if not os.path.exists(root+'.msg'): - RE> syslog('qrunner', 'Unlinking orphaned .db file: %s.db' % root) RE> - os.unlink(root+'.db') RE> was unlinking .db files before their corresponding .msg files RE> were created. We just removed those lines and it seems to be RE> ok -- no orphaned .db files and no more orphaned .msg files. I don't think this is quite the right fix. In MM2.0.x, qrunner keys off the .msg file, and Message.Enqueue() is careful to write the .db file first and then the .msg file. You're right that there's a race condition though, because the test in qrunner's main() is backwards! Since the .db is written first, followed by the .msg, you really want to look for orphaned .msg files not .db files. This is because .db files may legitimately not have a corresponding .msg file if the enqueing processes hasn't completed yet. See the attached patch for a better solution, which also includes some paranoia guards in Message.Enqueue() so that it atomically makes the .msg files available. Note: this patch is untested, and I don't think I'll get to testing it today. I will likely test it next week on python.org (yikes!) and if it looks good, I will issue a MM2.0.9 with this fix (and a few others). I'd appreciate it if someone else could verify this works and doesn't break anything else. RE> The second problem was that qrunner was crashing on some kinds RE> of bounce messages (?), causing messages to pile up in the RE> queue occasionally. It was generating errors like this: RE> The patch below seems to fix the problem (?)... we're now RE> seeing some errors like this: Good patch. This can happen if a message doesn't have a Content-Type: header at all. This will be in MM2.0.9 as well. -Barry -------------------- snip snip -------------------- Index: Mailman/Message.py =================================================================== RCS file: /cvsroot/mailman/mailman/Mailman/Message.py,v retrieving revision 1.40 diff -u -r1.40 Message.py --- Mailman/Message.py 2 Aug 2000 02:00:13 -0000 1.40 +++ Mailman/Message.py 29 Mar 2002 19:00:57 -0000 @@ -177,11 +177,14 @@ marshal.dump(msgdata, dbfp) dbfp.close() # If it doesn't already exist, or if the text of the message has - # changed, write the message file to disk. + # changed, write the message file to disk, but do it atomically so as + # not to hit a race condition with the qrunner. if not existsp or dirty: - msgfp = Utils.open_ex(msgfile, 'w') + tmpfile = msgfile + '.tmp' + msgfp = Utils.open_ex(tmpfile, 'w') msgfp.write(text) msgfp.close() + os.rename(tmpfile, msgfile) Index: cron/qrunner =================================================================== RCS file: /cvsroot/mailman/mailman/cron/Attic/qrunner,v retrieving revision 1.18.2.3 diff -u -r1.18.2.3 qrunner --- cron/qrunner 18 Apr 2001 03:58:35 -0000 1.18.2.3 +++ cron/qrunner 29 Mar 2002 19:00:58 -0000 @@ -195,13 +195,11 @@ lock.refresh() root, ext = os.path.splitext(os.path.join(mm_cfg.QUEUE_DIR, file)) if ext == '.db': - # If the .db file has no corresponding .msg file, we might as well - # remove the .db file, since there's little we can do about it. - if not os.path.exists(root+'.msg'): - syslog('qrunner', 'Unlinking orphaned .db file: %s.db' % root) - os.unlink(root+'.db') - # just trigger off the .msg files + # Just trigger off the .msg files continue + if not os.path.exists(root+'.db'): + syslog('qrunner', 'Unlinking orphaned .msg file: %s.msg' % root) + os.unlink(root+'.msg') # Have we exceeded either resource limit? if mm_cfg.QRUNNER_PROCESS_LIFETIME is not None and \ (time.time() - t0) > mm_cfg.QRUNNER_PROCESS_LIFETIME:
participants (2)
-
barry@zope.com
-
Rob Ellis