I've just identified a pretty bad bug in mailman 2.0.x qrunner. It can cause messages to get lost, so, I almost hate to say this, Barry, but it might be time for a quick 2.0.9 patch. Given the changes to queuing in 2.1, I think this bug isn't relevant to the 2.1 tree. If you're hit by the bug, you'll see occasional reports in the error log like: Apr 02 06:48:03 2002 qrunner(12168): Traceback (most recent call last): Apr 02 06:48:03 2002 qrunner(12168): File "/export/home/mailman/cron/qrunner", line 282, in ? Apr 02 06:48:03 2002 qrunner(12168): kids = main(lock) Apr 02 06:48:03 2002 qrunner(12168): File "/export/home/mailman/cron/qrunner", line 202, in main Apr 02 06:48:03 2002 qrunner(12168): os.unlink(root+'.db') Apr 02 06:48:03 2002 qrunner(12168): OSError : [Errno 2] No such file or direct ory: '/export/home/mailman/qfiles/74a651f8eba5fce7ca800968bcd105b0bddb0c96.db' Apr 02 06:48:24 2002 admin(12193): @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ Here's the failure scenario. It can happen to any mailman system, because it's a timing hole, but a busy system will be more susceptible because of the number of files being created in qfiles. Qrunner, as it processes, opens the qfiles directory inode and walks it until it runs out of files (or times out) and exits. It starts at the top and simply goes to the end. That's why 2.0.x isn't FIFO through the queue, processing is based on location within the directory inode. The problem is when "wrapper post" is writing into qfile at the same time qrunner is processing and deleting. If qrunner hits a message AS IT IS BEING WRITTEN by "wrapper post", bad things can start happening. Scripts/post writes the .db file, then the message file. IF it happens to write the .db file into qfiles, and then qrunner tries to process it, qrunner will see no .msg, declare it orphaned, and delete it. Post then writes the .msg file, closes the .db file, and since qrunner unlinked it, the .db file then gets removed. You then get left with an orphaned .msg file. Unfortunately, qrunner and post don't lock each other off files, and sort of by definition, you can't have one lock the other out of the directory during processing. But that leaves this tiny window where they can get really confused, and bad things happen. You end up with messages that are accepted by post, and then disappear forever. I'm seeing this about once a day on my big server now, simply because of the message volumes. There are a couple of ways to fix this in the source. The "right" way would be to put file locking into post and qrunner, so qrunner can tell that post has the file open and skip it. Given that the queueing is being completely redone in 2.1, I'd suggest a second fix. In cron/qrunner, about 200 where the check for the missing file is, put in a quick check of the creation time for the file. If it's less than 5 minutes old, simply leave it alone and don't unlink it. Because it'll go and check again next trip through, it'll unlink real orphans, but it protects you from this tiny window of oopsie. Normally, I'd say "on to 2.1", but since this is a fairly serious "silent data loss" bug AND the fix is trivial, I think it might make sense to patch this and roll 2.0.9. At the least, I think a patch needs to be approved by Barry and released and made visible on the lists.org website. Please don't ask me how I found this. I'd have to kill you. But this is one of the more obscure bugs I've ever found... (grin) The window of opportunity to trigger it is immensely small. You need two programs to be simultaneously updating the same directory inode, and processing the same slot IN that inode, at the exact same time. We're talking about a latency of, as far as I can tell, 5-15 milliseconds every time post writes a message into qfile, but only if qrunner is actively processing. It looks like Barry took care (from reading the source) to avoid this kind of situation -- but didn't quite lock the window closed. -- Chuq Von Rospach, Architech chuqui@plaidworks.com -- http://www.chuqui.com/ The first rule of holes: If you are in one, stop digging.
Chuq Von Rospach wrote:
I've just identified a pretty bad bug in mailman 2.0.x qrunner. It can
Just wanted to confirm this bug. My server is not exactly high-volume, but I now discovered I have gotten this error ONCE on ONE list since August 2001. :) January 13th, the following happened:
logs/error: Jan 13 23:46:02 2002 qrunner(23874): Traceback (most recent call last): Jan 13 23:46:02 2002 qrunner(23874): File "/home/mailman/cron/qrunner", line 282, in ? Jan 13 23:46:02 2002 qrunner(23874): kids = main(lock) Jan 13 23:46:02 2002 qrunner(23874): File "/home/mailman/cron/qrunner", line 202, in main Jan 13 23:46:03 2002 qrunner(23874): os.unlink(root+'.db') Jan 13 23:46:03 2002 qrunner(23874): OSError : [Errno 2] No such file or directory: '/home/mailman/qfiles/9fbaa916ea7f515e789a4928559e872f65b54d1f.db'
logs/qrunner: Jan 13 23:46:02 2002 (23874) Unlinking orphaned .db file: /home/mailman/qfiles/9fbaa916ea7f515e789a4928559e872f65b54d1f.db
logs/smtp: smtp:Jan 13 23:46:01 2002 (23874) smtp for 1 recips, completed in 0.724 seconds smtp:Jan 13 23:46:02 2002 (23874) smtp for 1 recips, completed in 0.513 seconds smtp:Jan 13 23:46:02 2002 (23874) smtp for 1 recips, completed in 0.709 seconds
Daniel, the norwegian dude. :D
On Wed, Apr 03, 2002 at 12:01:52AM +0200, Daniel Buchmann wrote:
Chuq Von Rospach wrote:
I've just identified a pretty bad bug in mailman 2.0.x qrunner. It can Just wanted to confirm this bug. My server is not exactly high-volume, but I now discovered I have gotten this error ONCE on ONE list since August 2001. :)
Since Dec. 5th, on my machine (about 500-900meg of outgoing mail/day):
bash-2.03$ grep os.unlink error |wc -l 82
Solaris 8/mailman 2.0.8/Postfix here.
Bill
-- Bill Bradford mrbill@mrbill.net Austin, TX
Chuq Von Rospach <chuqui@plaidworks.com> wrote:
The problem is when "wrapper post" is writing into qfile at the same time qrunner is processing and deleting. [..] Normally, I'd say "on to 2.1", but since this is a fairly serious "silent data loss" bug AND the fix is trivial, I think it might make sense to patch this and roll 2.0.9.
I think that like security bugs, potential data loss bugs also justify a bugfix maintenance release.
At the same time, if 2.0.9 is rolled out anyway, I'd suggest to put in that little patch to insert a Date: header. (You see the missing Date: header bug only if you run a MTA like Qmail which doesn't add missing Date: headers on its own.)
Greetings, Norbert.
-- A founder of the http://DotGNU.org project and Steering Committee member Norbert Bollow, Weidlistr.18, CH-8624 Gruet (near Zurich, Switzerland) Tel +41 1 972 20 59 Fax +41 1 972 20 69 http://thinkcoach.com List hosting with GNU Mailman on your own domain name http://cisto.com
On 4/2/02 2:05 PM, "Norbert Bollow" <nb@thinkcoach.com> wrote:
At the same time, if 2.0.9 is rolled out anyway, I'd suggest to put in that little patch to insert a Date: header.
The other thing I've noticed about qrunner is it doesn't reap it's children reliably (or at all, I'm not sure). So if you have a busy mailman system, it tends to collect zombie processes. This could create resource issues for some sites that are living a bit on the edge. And once qrunner exits, they all magically disappear... If we're rolling changes into 2.0.9, here's another one that ought to be considered, although again, it's mostly a big/b usy system issue.
-- Chuq Von Rospach, Architech chuqui@plaidworks.com -- http://www.chuqui.com/
IMHO: Jargon. Acronym for In My Humble Opinion. Used to flag as an opinion something that is clearly from context an opinion to everyone except the mentally dense. Opinions flagged by IMHO are actually rarely humble. IMHO. (source: third unabridged dictionary of chuqui-isms).
"CVR" == Chuq Von Rospach <chuqui@plaidworks.com> writes:
CVR> The other thing I've noticed about qrunner is it doesn't reap
CVR> it's children reliably (or at all, I'm not sure). So if you
CVR> have a busy mailman system, it tends to collect zombie
CVR> processes. This could create resource issues for some sites
CVR> that are living a bit on the edge. And once qrunner exits,
CVR> they all magically disappear... If we're rolling changes into
CVR> 2.0.9, here's another one that ought to be considered,
CVR> although again, it's mostly a big/b usy system issue.
But there shouldn't /be/ any children. The only code that calls fork() in 2.0.x is the mail->news gateway (ignore the test code in LockFile.py). Are you running any gated lists Chuq?
-Barry
"NB" == Norbert Bollow <nb@thinkcoach.com> writes:
NB> At the same time, if 2.0.9 is rolled out anyway, I'd suggest
NB> to put in that little patch to insert a Date: header. (You
NB> see the missing Date: header bug only if you run a MTA like
NB> Qmail which doesn't add missing Date: headers on its own.)
Hmm, can't Qmail be taught to add the Date: header if it's missing, like every other MTA I'm aware of?
-Barry
"BAW" == Barry A Warsaw <barry@zope.com> writes: "NB" == Norbert Bollow <nb@thinkcoach.com> writes:
NB> At the same time, if 2.0.9 is rolled out anyway, I'd suggest
NB> to put in that little patch to insert a Date: header. (You
NB> see the missing Date: header bug only if you run a MTA like
NB> Qmail which doesn't add missing Date: headers on its own.)
BAW> Hmm, can't Qmail be taught to add the Date: header if it's
BAW> missing, like every other MTA I'm aware of?
When you call qmail, if you want to add the date header, call the "predate" program with the path to qmail's sendmail as an argument:
predate /path/to/sendmail (arguments)
Ben
-- Brought to you by the letters S and C and the number 2. "I don't want the world.. I just want your half." Debian GNU/Linux maintainer of Gimp and Nethack -- http://www.debian.org/
Ben Gertzfield <che@debian.org> wrote:
"BAW" == Barry A Warsaw <barry@zope.com> writes: [..] BAW> Hmm, can't Qmail be taught to add the Date: header if it's BAW> missing, like every other MTA I'm aware of?
I think that DJB's "no message munging" approach is the correct one.
(There's nothing wrong with some kinds of munging if you know what you're doing and make sure that this munging happens only to messages that originate from your system. But "teaching" an MTA to munge messages in general, without such a restriction, is a bad idea IMO.)
When you call qmail, if you want to add the date header, call the "predate" program with the path to qmail's sendmail as an argument:
predate /path/to/sendmail (arguments)
This approach is not available when we pass the message to Qmail by talking SMTP, like SMTPDirect.py does.
Let's just apply that tiny little patch and give the MTA an RFC-compliant message, and consider the issue closed.
Greetings, Norbert.
-- A founder of the http://DotGNU.org project and Steering Committee member Norbert Bollow, Weidlistr.18, CH-8624 Gruet (near Zurich, Switzerland) Tel +41 1 972 20 59 Fax +41 1 972 20 69 http://thinkcoach.com List hosting with GNU Mailman on your own domain name http://cisto.com
On Wednesday 03 April 2002 05:35 am, you wrote:
Ben Gertzfield <che@debian.org> wrote:
> "BAW" == Barry A Warsaw <barry@zope.com> writes:
[..] BAW> Hmm, can't Qmail be taught to add the Date: header if it's BAW> missing, like every other MTA I'm aware of?
I think that DJB's "no message munging" approach is the correct one.
(There's nothing wrong with some kinds of munging if you know what you're doing and make sure that this munging happens only to messages that originate from your system. But "teaching" an MTA to munge messages in general, without such a restriction, is a bad idea IMO.)
When you call qmail, if you want to add the date header, call the "predate" program with the path to qmail's sendmail as an argument:
predate /path/to/sendmail (arguments)
This approach is not available when we pass the message to Qmail by talking SMTP, like SMTPDirect.py does.
Let's just apply that tiny little patch and give the MTA an RFC-compliant message, and consider the issue closed.
This was discussed a couple of months ago on this list and it was decided that it would be ok to put a date in to make it rfc compliant.
Norbert Bollow <nb@thinkcoach.com> writes:
This approach is not available when we pass the message to Qmail by talking SMTP, like SMTPDirect.py does.
Let's just apply that tiny little patch and give the MTA an RFC-compliant message, and consider the issue closed.
Then shouldn't you also have Mailman add a Message-ID header as well? SMTP on qmail doesn't add that either.
Of course, both of these headers are now added in Mailman 2.1. Perhaps just wait for that to be released instead?
"JRM" == Jason R Mastaler <jason-list-mailman-developers@mastaler.com> writes:
JRM> Then shouldn't you also have Mailman add a Message-ID header
JRM> as well? SMTP on qmail doesn't add that either.
I thought about the same thing, and /almost/ added it, but then I checked RFC 2822. While it requires one Date: header, Message-ID: is actually /not/ required, although it "SHOULD be present". I took that as license to be as conservative as possible. :)
-Barry
On 4/3/02 6:39 PM, "Barry A. Warsaw" <barry@zope.com> wrote:
I thought about the same thing, and /almost/ added it, but then I checked RFC 2822. While it requires one Date: header, Message-ID: is actually /not/ required, although it "SHOULD be present". I took that as license to be as conservative as possible. :)
I'm curious, and in honesty, haven't looked. Is qmail RFC compliant here? Or is it a case of doing what the authors feel is right, not what the standards tell them to?
-- Chuq Von Rospach, Architech chuqui@plaidworks.com -- http://www.chuqui.com/
No! No! Dead girl, OFF the table! -- Shrek
"CVR" == Chuq Von Rospach <chuqui@plaidworks.com> writes:
>> I thought about the same thing, and /almost/ added it, but then
>> I checked RFC 2822. While it requires one Date: header,
>> Message-ID: is actually /not/ required, although it "SHOULD be
>> present". I took that as license to be as conservative as
>> possible. :)
CVR> I'm curious, and in honesty, haven't looked. Is qmail RFC
CVR> compliant here? Or is it a case of doing what the authors
CVR> feel is right, not what the standards tell them to?
I think Qmail is probably within its rights to reject a message without a From: field or a Date: field, as RFC 2822, section 3.6.1 says:
3.6.1. The origination date field
[...] In any case, it is specifically not intended to convey the time that the message is actually transported, but rather the time at which the human or other creator of the message has put the message into its final form, ready for transport.
It's probably RFC 2821, though that specifies what the SMTP server is allowed to do in the face of invalid message content. Scanning both RFCs, I really can't find the connection, but it's probably valid for Qmail to return a 50x error after the DATA command. (I tested the one Qmail server I know about, starship.python.net, and it didn't seem to complain.)
Personally, though, I think it's thickheaded for Qmail to reject messages from localhost that are missing any headers it's perfectly capable of supplying. Every other MTA does it, so I think Qmail's just being obstinate.
I'd also appreciate if any RFC-lawyers can point to specific text to back up Qmail's opinion.
-Barry
barry@zope.com (Barry A. Warsaw) writes:
I think Qmail is probably within its rights to reject a message without a From: field or a Date: field, as RFC 2822, section 3.6.1
qmail doesn't reject such messages, it simply passes them on without the headers. This means the receiving MTA usually ends up adding them which might be confusing for the recipient.
"JRM" == Jason R Mastaler <jason-list-mailman-developers@mastaler.com> writes:
>> I think Qmail is probably within its rights to reject a message
>> without a From: field or a Date: field, as RFC 2822, section
>> 3.6.1
JRM> qmail doesn't reject such messages, it simply passes them on
JRM> without the headers. This means the receiving MTA usually
JRM> ends up adding them which might be confusing for the
JRM> recipient.
But isn't that different? Doesn't this mean Qmail is violating the standards too?
-Barry
barry@zope.com (Barry A. Warsaw) writes:
But isn't that different? Doesn't this mean Qmail is violating the standards too?
How would it be in violation? qmail-smtpd, the program which receives mail SMTP didn't originate the message. On the other hand, qmail-inject (/usr/sbin/sendmail equivalent) *does* add a missing Date, From, etc. to messages.
"JRM" == Jason R Mastaler <jason-list-mailman-developers@mastaler.com> writes:
>> But isn't that different? Doesn't this mean Qmail is violating
>> the standards too?
JRM> How would it be in violation? qmail-smtpd, the program which
JRM> receives mail SMTP didn't originate the message. On the
JRM> other hand, qmail-inject (/usr/sbin/sendmail equivalent)
JRM> *does* add a missing Date, From, etc. to messages.
So if a process handed qmail-smtpd a message without a Date: header, and it sends the message on to some remote smtpd without adding the Date: header, is that legal? I guess you'd say because Qmail wasn't the originator of the message, it would be, but the remote smtpd would be within its rights to reject it.
Or am I being dense?
-Barry
barry@zope.com (Barry A. Warsaw) writes:
So if a process handed qmail-smtpd a message without a Date: header, and it sends the message on to some remote smtpd without adding the Date: header, is that legal? I guess you'd say because Qmail wasn't the originator of the message, it would be
Playing the qmail advocate, I'd probably say this, yes.
but the remote smtpd would be within its rights to reject it.
Perhaps, but I still don't think this makes qmail's behavior illegal. Given a rejected message, the trail of guilt would lead back to the originating program, where the problem should be corrected.
Jason R. Mastaler <jason-list-mailman-developers@mastaler.com> wrote:
Perhaps, but I still don't think this makes qmail's behavior illegal. Given a rejected message, the trail of guilt would lead back to the originating program, where the problem should be corrected.
Yes, precisely. And that's why I think it's wrong of MTAs to try to fix malformatted messages: This obscures such message formatting problems in potentially-hard-to-debug ways.
Greetings, Norbert.
-- A founder of the http://DotGNU.org project and Steering Committee member Norbert Bollow, Weidlistr.18, CH-8624 Gruet (near Zurich, Switzerland) Tel +41 1 972 20 59 Fax +41 1 972 20 69 http://thinkcoach.com List hosting with GNU Mailman on your own domain name http://cisto.com
Chuq Von Rospach <chuqui@plaidworks.com> writes:
I'm curious, and in honesty, haven't looked. Is qmail RFC compliant here?
Yes.
Or is it a case of doing what the authors feel is right, not what the standards tell them to?
qmail has only one author, D. J. Bernstein, and all his software is extremely standards conscious. In fact, this explains why qmail doesn't add a missing Date header for messages coming in via SMTP. qmail didn't create the message, Mailman did, and therefore it's Mailman's job as originator to add a Date line.
Other MTAs which add missing required headers are essentially just covering for non-compliant programs. djb isn't that kind I guess <wink>.
"JRM" == Jason R Mastaler <jason-list-mailman-developers@mastaler.com> writes:
JRM> Other MTAs which add missing required headers are essentially
JRM> just covering for non-compliant programs. djb isn't that
JRM> kind I guess <wink>.
The only argument I'd make is that the standards are really geared toward cooperation of alien systems, i.e. two unrelated processes that need to exchange mail messages. In Mailman's case, an argument can be made that it and its MTA are working in tandem to perform a function. They know a lot about each other, communicate over a semi-private channel (i.e. localhost:25) are controlled and configured by the same entities, etc.
So again, I have no problem with Qmail rejecting messages from the big bad world that are ill-formed, but it could be more cooperative with Mailman. I guess Qmail rejects ill-formed messages posted from local MUAs too.
OTOH, I can't and won't really argue this point much, because I also feel that Mailman should comply with the appropriate standards, so adding the Date: header for internally generated messages is really the right thing to do anyway.
-Barry
barry@zope.com (Barry A. Warsaw) writes:
The only argument I'd make is that the standards are really geared toward cooperation of alien systems, i.e. two unrelated processes that need to exchange mail messages. In Mailman's case, an argument can be made that it and its MTA are working in tandem to perform a function.
That would be a practical and reasonable argument. Some criticize djb for interpreting the RFCs too literally.
OTOH, I can't and won't really argue this point much, because I also feel that Mailman should comply with the appropriate standards, so adding the Date: header for internally generated messages is really the right thing to do anyway.
Indeed.
Jason R. Mastaler <jason-list-mailman-developers@mastaler.com> wrote:
barry@zope.com (Barry A. Warsaw) writes:
The only argument I'd make is that the standards are really geared toward cooperation of alien systems, i.e. two unrelated processes that need to exchange mail messages. In Mailman's case, an argument can be made that it and its MTA are working in tandem to perform a function.
That would be a practical and reasonable argument.
Yes. This means that if you're maintaining both the MTA and a MLM (mailing list manager) program which gives messages to it, then you can set things up so that the MLM expects the MTA to munge the messages in some ways that are not specified in the RFCs.
However, given that the GNU Mailman project maintains only an MLM and no MTA, and this MLM communicates with the MTA through an interface that is specified in an internet standard, there is IMO no doubt that it's Mailman's responsibility to comply with this standard.
Barry A. Warsaw <barry@zope.com> wrote:
"JRM" == Jason R Mastaler <jason-list-mailman-developers@mastaler.com> writes:
JRM> Then shouldn't you also have Mailman add a Message-ID header JRM> as well? SMTP on qmail doesn't add that either.
I thought about the same thing, and /almost/ added it, but then I checked RFC 2822. While it requires one Date: header, Message-ID: is actually /not/ required, although it "SHOULD be present". I took that as license to be as conservative as possible. :)
The term "SHOULD" has a well-defined meaning in RFCs... RFC2822 explicitly invokes RFC2119, where this, and related terms are defined as follows:
MUST This word, or the terms "REQUIRED" or "SHALL", mean that the definition is an absolute requirement of the specification.
MUST NOT This phrase, or the phrase "SHALL NOT", mean that the definition is an absolute prohibition of the specification.
SHOULD This word, or the adjective "RECOMMENDED", mean that there may exist valid reasons in particular circumstances to ignore a particular item, but the full implications must be understood and carefully weighed before choosing a different course.
[..]
With other words, given that the Message-ID: header is a SHOULD, if you want to choose a diffferent course, you (as maintainer of Mailman) have a responsibility to understand and carefully consider the full implications of not adding a Message-ID: header to messages that are _originated_ by Mailman.
There are several such implications, in areas like automated filtering of duplicate messages, debugging mail problems (some MTAs log the contents of the Message-ID: header), and spam filters (I know of at least one ISP where all messages without Message-ID: are considered to be probably spam).
I think it's probably much less work to just add the header than to "understand" and "carefully weigh" all these implications.
Greetings, Norbert.
-- A founder of the http://DotGNU.org project and Steering Committee member Norbert Bollow, Weidlistr.18, CH-8624 Gruet (near Zurich, Switzerland) Tel +41 1 972 20 59 Fax +41 1 972 20 69 http://thinkcoach.com List hosting with GNU Mailman on your own domain name http://cisto.com
Norbert Bollow <nb@thinkcoach.com> writes:
With other words, given that the Message-ID: header is a SHOULD, if you want to choose a diffferent course, you (as maintainer of Mailman) have a responsibility to understand and carefully consider the full implications of not adding a Message-ID: header to messages that are _originated_ by Mailman.
[...]
I think it's probably much less work to just add the header than to "understand" and "carefully weigh" all these implications.
The attached patch (for MM 2.0.x) adds a Message-ID header to messages which lack one. Utils.make_msgid() is just a Python 1.x compatible rendition of the same function that is part of the latest email package.
barry@zope.com (Barry A. Warsaw) writes:
I thought about the same thing, and /almost/ added it, but then I checked RFC 2822. While it requires one Date: header, Message-ID: is actually /not/ required, although it "SHOULD be present". I took that as license to be as conservative as possible. :)
Works for me. I plan to upgrade to 2.1 as soon as it's released anyway.
Apr 02 06:48:03 2002 qrunner(12168): os.unlink(root+'.db') Apr 02 06:48:03 2002 qrunner(12168): OSError : [Errno 2] No such file or
See Barry's recent response to my posting about 'qrunner on solaris' -- he's got a patch, was looking for testers...
- Rob
-- Rob Ellis <rob@web.ca> System Administrator, Web Networks
On Tue, Apr 02, 2002 at 01:39:06PM -0800, Chuq Von Rospach wrote:
I've just identified a pretty bad bug in mailman 2.0.x qrunner. It can cause messages to get lost, so, I almost hate to say this, Barry, but it might be time for a quick 2.0.9 patch. Given the changes to queuing in 2.1, I think this bug isn't relevant to the 2.1 tree.
If you're hit by the bug, you'll see occasional reports in the error log like:
Just to confirm that you're not crazy, I've seen this on sf.net too.
Marc
Microsoft is to operating systems & security .... .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | Finger marc_f@merlins.org for PGP key
"CVR" == Chuq Von Rospach <chuqui@plaidworks.com> writes:
CVR> I've just identified a pretty bad bug in mailman 2.0.x
CVR> qrunner. It can cause messages to get lost, so, I almost hate
CVR> to say this, Barry, but it might be time for a quick 2.0.9
CVR> patch. Given the changes to queuing in 2.1, I think this bug
CVR> isn't relevant to the 2.1 tree.
You know, I think I just fixed this on Friday, although I didn't get a chance to check everything into cvs before the holiday weekend. I definitely didn't get a chance to test it.
It is a valid bug, and it does warrant a 2.0.9 patch. The basic bug is caused by a disagreement on the order of .db and .msg file writing between qrunner and Message.Enqueue(), as Chuq rightly observes. I think the fix is simpler than what Chuq outlines, though.
Message.Enqueue() breaks the race by writing the .db file before it writes the .msg file, but qrunner's logic is backwards! It ignores the .msg files but it should be ignoring the .db files, since they're written first.
The fix is to qrunner, which should ignore .db files, triggering only on .msg files. If it finds a .msg file without a corresponding .db file, then it should unlink the orphaned .msg file. The final piece of the puzzle is that Message.Enqueue() should write the .msg file atomically, meaning, write it to a tmp file and use rename() to move it into place atomically.
CVR> Normally, I'd say "on to 2.1", but since this is a fairly
CVR> serious "silent data loss" bug AND the fix is trivial, I
CVR> think it might make sense to patch this and roll 2.0.9. At
CVR> the least, I think a patch needs to be approved by Barry and
CVR> released and made visible on the lists.org website.
Everything's checked into cvs now, and I'm about to do some testing. The more eyeballs on this code, the better, since it is so integral to the proper operation of the system. After some off-line stress testing, I'll foist this patch on python.org, watch the logs for a day or two, and then do the 2.0.9 release.
CVR> Please don't ask me how I found this. I'd have to kill
CVR> you. But this is one of the more obscure bugs I've ever
CVR> found... (grin)
Indeed. While I've seen the occasional reports of this for a while, it's nearly impossible to reproduce, and even with the traffic we see on python.org/zope.org, I've /never/ seen it there.
CVR> The window of opportunity to trigger it is immensely
CVR> small. You need two programs to be simultaneously updating
CVR> the same directory inode, and processing the same slot IN
CVR> that inode, at the exact same time. We're talking about a
CVR> latency of, as far as I can tell, 5-15 milliseconds every
CVR> time post writes a message into qfile, but only if qrunner is
CVR> actively processing. It looks like Barry took care (from
CVR> reading the source) to avoid this kind of situation -- but
CVR> didn't quite lock the window closed.
Unless I still haven't recovered from Maryland's glorious and long-awaited victory last night[*], I'm surprised this one snuck past us for so long. It jumped right out at me when I reviewed the code again. Sigh.
Fearing-the-turtle-ly y'rs, -Barry
[*] NCAA (college) men's basketball national champs.
Actually, I think I've seen two of these on my system as well, which I'm sure isn't nearly as busy as yours, Chuq, but I've been known on occasion to have quite a pileup!
Bob
I've just identified a pretty bad bug in mailman 2.0.x qrunner. It can cause messages to get lost, so, I almost hate to say this, Barry, but it might be time for a quick 2.0.9 patch. Given the changes to queuing in 2.1, I think this bug isn't relevant to the 2.1 tree.
participants (11)
-
barry@zope.com
-
Ben Gertzfield
-
Bill Bradford
-
Bob Puff@NLE
-
Chuq Von Rospach
-
Daniel Buchmann
-
Jason R. Mastaler
-
Marc MERLIN
-
Norbert Bollow
-
Phil Barnett
-
Rob Ellis