I've just identified a pretty bad bug in mailman 2.0.x qrunner. It can cause messages to get lost, so, I almost hate to say this, Barry, but it might be time for a quick 2.0.9 patch. Given the changes to queuing in 2.1, I think this bug isn't relevant to the 2.1 tree. If you're hit by the bug, you'll see occasional reports in the error log like: Apr 02 06:48:03 2002 qrunner(12168): Traceback (most recent call last): Apr 02 06:48:03 2002 qrunner(12168): File "/export/home/mailman/cron/qrunner", line 282, in ? Apr 02 06:48:03 2002 qrunner(12168): kids = main(lock) Apr 02 06:48:03 2002 qrunner(12168): File "/export/home/mailman/cron/qrunner", line 202, in main Apr 02 06:48:03 2002 qrunner(12168): os.unlink(root+'.db') Apr 02 06:48:03 2002 qrunner(12168): OSError : [Errno 2] No such file or direct ory: '/export/home/mailman/qfiles/74a651f8eba5fce7ca800968bcd105b0bddb0c96.db' Apr 02 06:48:24 2002 admin(12193): @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ Here's the failure scenario. It can happen to any mailman system, because it's a timing hole, but a busy system will be more susceptible because of the number of files being created in qfiles. Qrunner, as it processes, opens the qfiles directory inode and walks it until it runs out of files (or times out) and exits. It starts at the top and simply goes to the end. That's why 2.0.x isn't FIFO through the queue, processing is based on location within the directory inode. The problem is when "wrapper post" is writing into qfile at the same time qrunner is processing and deleting. If qrunner hits a message AS IT IS BEING WRITTEN by "wrapper post", bad things can start happening. Scripts/post writes the .db file, then the message file. IF it happens to write the .db file into qfiles, and then qrunner tries to process it, qrunner will see no .msg, declare it orphaned, and delete it. Post then writes the .msg file, closes the .db file, and since qrunner unlinked it, the .db file then gets removed. You then get left with an orphaned .msg file. Unfortunately, qrunner and post don't lock each other off files, and sort of by definition, you can't have one lock the other out of the directory during processing. But that leaves this tiny window where they can get really confused, and bad things happen. You end up with messages that are accepted by post, and then disappear forever. I'm seeing this about once a day on my big server now, simply because of the message volumes. There are a couple of ways to fix this in the source. The "right" way would be to put file locking into post and qrunner, so qrunner can tell that post has the file open and skip it. Given that the queueing is being completely redone in 2.1, I'd suggest a second fix. In cron/qrunner, about 200 where the check for the missing file is, put in a quick check of the creation time for the file. If it's less than 5 minutes old, simply leave it alone and don't unlink it. Because it'll go and check again next trip through, it'll unlink real orphans, but it protects you from this tiny window of oopsie. Normally, I'd say "on to 2.1", but since this is a fairly serious "silent data loss" bug AND the fix is trivial, I think it might make sense to patch this and roll 2.0.9. At the least, I think a patch needs to be approved by Barry and released and made visible on the lists.org website. Please don't ask me how I found this. I'd have to kill you. But this is one of the more obscure bugs I've ever found... (grin) The window of opportunity to trigger it is immensely small. You need two programs to be simultaneously updating the same directory inode, and processing the same slot IN that inode, at the exact same time. We're talking about a latency of, as far as I can tell, 5-15 milliseconds every time post writes a message into qfile, but only if qrunner is actively processing. It looks like Barry took care (from reading the source) to avoid this kind of situation -- but didn't quite lock the window closed. -- Chuq Von Rospach, Architech chuqui@plaidworks.com -- http://www.chuqui.com/ The first rule of holes: If you are in one, stop digging.