[Mailman-Developers] 2.0.x qrunner bug (bad one)

Chuq Von Rospach chuqui@plaidworks.com
Tue, 02 Apr 2002 13:39:06 -0800


I've just identified a pretty bad bug in mailman 2.0.x qrunner. It can cause
messages to get lost, so, I almost hate to say this, Barry, but it might be
time for a quick 2.0.9 patch. Given the changes to queuing in 2.1, I think
this bug isn't relevant to the 2.1 tree.

If you're hit by the bug, you'll see occasional reports in the error log
like:

Apr 02 06:48:03 2002 qrunner(12168): Traceback (most recent call last):
Apr 02 06:48:03 2002 qrunner(12168):   File
"/export/home/mailman/cron/qrunner",
 line 282, in ?
Apr 02 06:48:03 2002 qrunner(12168):      kids = main(lock)
Apr 02 06:48:03 2002 qrunner(12168):   File
"/export/home/mailman/cron/qrunner",
 line 202, in main
Apr 02 06:48:03 2002 qrunner(12168):      os.unlink(root+'.db')
Apr 02 06:48:03 2002 qrunner(12168): OSError :  [Errno 2] No such file or
direct
ory: 
'/export/home/mailman/qfiles/74a651f8eba5fce7ca800968bcd105b0bddb0c96.db'
Apr 02 06:48:24 2002 admin(12193):
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

Here's the failure scenario. It can happen to any mailman system, because
it's a timing hole, but a busy system will be more susceptible because of
the number of files being created in qfiles.

Qrunner, as it processes, opens the qfiles directory inode and walks it
until it runs out of files (or times out) and exits. It starts at the top
and simply goes to the end. That's why 2.0.x isn't FIFO through the queue,
processing is based on location  within the directory inode.

The problem is when "wrapper post" is writing into qfile at the same time
qrunner is processing and deleting. If qrunner hits a message AS IT IS BEING
WRITTEN by "wrapper post", bad things can start happening.

Scripts/post writes the .db file, then the message file. IF it happens to
write the .db file into qfiles, and then qrunner tries to process it,
qrunner will see no .msg, declare it orphaned, and delete it. Post then
writes the .msg file, closes the .db file, and since qrunner unlinked it,
the .db file then gets removed. You then get left with an orphaned .msg
file. 

Unfortunately, qrunner and post don't lock each other off files, and sort of
by definition, you can't have one lock the other out of the directory during
processing. But that leaves this tiny window where they can get really
confused, and bad things happen. You end up with messages that are accepted
by post, and then disappear forever.

I'm seeing this about once a day on my big server now, simply because of the
message volumes. 

There are a couple of ways to fix this in the source. The "right" way would
be to put file locking into post and qrunner, so qrunner can tell that post
has the file open and skip it. Given that the queueing is being completely
redone in 2.1, I'd suggest a second fix. In cron/qrunner, about 200 where
the check for the missing file is, put in a quick check of the creation time
for the file. If it's less than 5 minutes old, simply leave it alone and
don't unlink it. Because it'll go and check again next trip through, it'll
unlink real orphans, but it protects you from this tiny window of oopsie.

Normally, I'd say "on to 2.1", but since this is a fairly serious "silent
data loss" bug AND the fix is trivial, I think it might make sense to patch
this and roll 2.0.9. At the least, I think a patch needs to be approved by
Barry and released and made visible on the lists.org website.

Please don't ask me how I found this. I'd have to kill you. But this is one
of the more obscure bugs I've ever found... (grin) The window of opportunity
to trigger it is immensely small. You need two programs to be simultaneously
updating the same directory inode, and processing the same slot IN that
inode, at the exact same time. We're talking about a latency of, as far as I
can tell, 5-15 milliseconds every time post writes a message into qfile, but
only if qrunner is actively processing. It looks like Barry took care (from
reading the source) to avoid this kind of situation -- but didn't quite lock
the window closed.





-- 
Chuq Von Rospach, Architech
chuqui@plaidworks.com -- http://www.chuqui.com/

The first rule of holes: If you are in one, stop digging.