"CVR" == Chuq Von Rospach <chuqui@plaidworks.com> writes:
CVR> I've just identified a pretty bad bug in mailman 2.0.x
CVR> qrunner. It can cause messages to get lost, so, I almost hate
CVR> to say this, Barry, but it might be time for a quick 2.0.9
CVR> patch. Given the changes to queuing in 2.1, I think this bug
CVR> isn't relevant to the 2.1 tree.
You know, I think I just fixed this on Friday, although I didn't get a chance to check everything into cvs before the holiday weekend. I definitely didn't get a chance to test it.
It is a valid bug, and it does warrant a 2.0.9 patch. The basic bug is caused by a disagreement on the order of .db and .msg file writing between qrunner and Message.Enqueue(), as Chuq rightly observes. I think the fix is simpler than what Chuq outlines, though.
Message.Enqueue() breaks the race by writing the .db file before it writes the .msg file, but qrunner's logic is backwards! It ignores the .msg files but it should be ignoring the .db files, since they're written first.
The fix is to qrunner, which should ignore .db files, triggering only on .msg files. If it finds a .msg file without a corresponding .db file, then it should unlink the orphaned .msg file. The final piece of the puzzle is that Message.Enqueue() should write the .msg file atomically, meaning, write it to a tmp file and use rename() to move it into place atomically.
CVR> Normally, I'd say "on to 2.1", but since this is a fairly
CVR> serious "silent data loss" bug AND the fix is trivial, I
CVR> think it might make sense to patch this and roll 2.0.9. At
CVR> the least, I think a patch needs to be approved by Barry and
CVR> released and made visible on the lists.org website.
Everything's checked into cvs now, and I'm about to do some testing. The more eyeballs on this code, the better, since it is so integral to the proper operation of the system. After some off-line stress testing, I'll foist this patch on python.org, watch the logs for a day or two, and then do the 2.0.9 release.
CVR> Please don't ask me how I found this. I'd have to kill
CVR> you. But this is one of the more obscure bugs I've ever
CVR> found... (grin)
Indeed. While I've seen the occasional reports of this for a while, it's nearly impossible to reproduce, and even with the traffic we see on python.org/zope.org, I've /never/ seen it there.
CVR> The window of opportunity to trigger it is immensely
CVR> small. You need two programs to be simultaneously updating
CVR> the same directory inode, and processing the same slot IN
CVR> that inode, at the exact same time. We're talking about a
CVR> latency of, as far as I can tell, 5-15 milliseconds every
CVR> time post writes a message into qfile, but only if qrunner is
CVR> actively processing. It looks like Barry took care (from
CVR> reading the source) to avoid this kind of situation -- but
CVR> didn't quite lock the window closed.
Unless I still haven't recovered from Maryland's glorious and long-awaited victory last night[*], I'm surprised this one snuck past us for so long. It jumped right out at me when I reviewed the code again. Sigh.
Fearing-the-turtle-ly y'rs, -Barry
[*] NCAA (college) men's basketball national champs.