Re: [Mailman-Developers] problem with view other subscriptions..

[Matt Davis]
The above strangeness was probably caused by a buglet in the current CVS LockFile.py; I believe it gets the order of unlink() calls wrong.
As I see it, it is more important for the lock file to never have a link count that is neither 0 or 2 than it is to make sure there are no tempfile turds. This implies that the real lock file should be unlink()ed before the tempfile, and not the other way around. Here's a (untested) patch (which also touches on some other issues I noticed while I was at it :):
I noticed another apparent inconsistency in LockFile.py, too: The comment at the start of __break() seems to imply that calling __touch() will totally remove the race condition, while I believe all it does is make the race condition a little less likely.
However, I don't think that neither the order of unlink calls nor any race condition is responsible for the problem you're having -- in fact, I believe everyone running current CVS Mailman will have the exact same problem. Here's another (untested) patch which tries to address the problem (these patches are not supposed to depend on the above patch, BTW):
HTH, HAND,
Harald

On 18 May 2000, Harald Meland wrote:
Applied all patches & got this.. I'm not much of a programmer :\ Or I could be of more help...
Traceback (innermost last): File "/home/mailman/scripts/driver", line 89, in run_main main() File "/home/mailman/Mailman/Cgi/handle_opts.py", line 78, in main process_form(mlist, user, doc) File "/home/mailman/Mailman/Cgi/handle_opts.py", line 168, in process_form from_mlist=mlist)) File "/home/mailman/Mailman/Utils.py", line 433, in map_maillists if mlist and mlist.internal_name() == i: NameError: mlist
This is the code from around there.
from Mailman import MailList
if names == None:
names = list_names()
got = []
for i in names:
if verbose:
print i,
433> if mlist and mlist.internal_name() == i: 434> l = mlist else: MailList.MailList(i, lock=lock) if verbose: print got.append(apply(func, (l,))) if verbose: print got[-1] if not (mlist and mlist.internal_name() == i): l.Unlock() del l return got
-- Matt Davis - ICQ# 934680 http://dogpound.vnet.net/
Any nitwit can understand computers. Many do. - Ted Nelson

Hi Harald,
I've finally gotten around to looking at your LockFile.py changes. Note that I checked in a different change to fix the actual problem with "view other subscriptions", but I still want to comment on your diffs.
HM> The above strangeness was probably caused by a buglet in the
HM> current CVS LockFile.py; I believe it gets the order of
HM> unlink() calls wrong.
HM> As I see it, it is more important for the lock file to never
HM> have a link count that is neither 0 or 2 than it is to make
HM> sure there are no tempfile turds. This implies that the real
HM> lock file should be unlink()ed before the tempfile, and not
HM> the other way around. Here's a (untested) patch (which also
HM> touches on some other issues I noticed while I was at it :):
Thinking about this more, I disagree. I think it is perfectly fine to expect the linkcount to be 0, 1, or 2, but never greater than 2. Let's look at each in detail. Remember we're talking about the link count of the shared global file, not the process specific tmp files.
Linkcount == 0: this just means that lockfile doesn't exist, so we don't actually ever get here (we'd get an ENOENT instead).
Linkcount == 1: One way this can happen is due to a race condition between proc1 which is in the middle of unlocking and proc2 which has just laid claim to the lock. Here's how this goes:
-- proc1 had a valid lock and is now in unlock(). It unlinks its
tmpfname, leaving lockfile's linkcount == 1
-- proc2 comes along and links lockfile to /its/ tmpfname. This
fails because lockfile already exists, but linkcount == 1. I
believe we can just ignore this case and loop around to attempt
acquisition again next time.
-- proc1 now unlinks lockfile, relinquishing the lock
-- proc2 loops and attempts to link lockfile -> tmpfname
This situation was broken but only because the big if/else in the OSError clause in lock() didn't take this race condition into account. I think the better fix is to just catch the linkcount==1 case, and essentially ignore it. To me, the more important invariant is that if the global lockfile exists, somebody else has the lock. Changing the order of unlinks breaks this invariant.
Linkcount == 2: Somebody's got a valid lock.
Linkcount > 2: Somebody's messing with us. I cannot see any situation where linkcount would be > 2 without "outside influences".
Okay, now to specific changes in your patch. Apologies for the ugly nature of the citations.
| + # HM: I don't understand why the tempfile should be
| + # unlinked in this situation. I think that all it w
| + # is confuse things, so I've commented it out.
| + ## os.unlink(self.__tmpfname)
You're right, this should not remove the tmpfname. This would be like only half giving up the lock! I've removed this unlink() call.
| - NotLockedError, unless optional `unconditional' is true.
| + NotLockedError, unless optional `unconditionally' is true.
Applied.
| - islocked = 0
| - try:
| - islocked = self.locked()
| - except OSError, e:
| - if e.errno <> errno.ENOENT: raise
| + islocked = self.locked()
Applied.
| - try:
| - winner = self.__read()
| - if winner:
| - os.unlink(winner)
| - except OSError, e:
| - if e.errno <> errno.ENOENT: raise
| + # Try finding the name of the old winner's temp file. We're ass
| + # the winner process has hung or died, so we should try cleaning
| + # their mess.
| + winner = self.__read()
Applied.
HM> I noticed another apparent inconsistency in LockFile.py,
HM> too: The comment at the start of __break() seems to imply
HM> that calling __touch() will totally remove the race
HM> condition, while I believe all it does is make the race
HM> condition a little less likely.
You're right of course. I've updated the comment, but not changed the code because I think we're adequately defensive in the face of this race condition. __break() doesn't acquire a lock and it doesn't matter if two processes attempt to break the same lock at the same time.
Thanks very much for looking at all this. I'm about to check in a new version of LockFile.py which also contains a slightly revised unit test.
Cheers, -Barry

On Thu, 1 Jun 2000, Barry A. Warsaw wrote:
Bingo :) Thanks you 2. For whatever you did its working just fine.
-- Matt Davis - ICQ# 934680 http://dogpound.vnet.net/
Spelling checkers at maximum! Fire!

On 18 May 2000, Harald Meland wrote:
Applied all patches & got this.. I'm not much of a programmer :\ Or I could be of more help...
Traceback (innermost last): File "/home/mailman/scripts/driver", line 89, in run_main main() File "/home/mailman/Mailman/Cgi/handle_opts.py", line 78, in main process_form(mlist, user, doc) File "/home/mailman/Mailman/Cgi/handle_opts.py", line 168, in process_form from_mlist=mlist)) File "/home/mailman/Mailman/Utils.py", line 433, in map_maillists if mlist and mlist.internal_name() == i: NameError: mlist
This is the code from around there.
from Mailman import MailList
if names == None:
names = list_names()
got = []
for i in names:
if verbose:
print i,
433> if mlist and mlist.internal_name() == i: 434> l = mlist else: MailList.MailList(i, lock=lock) if verbose: print got.append(apply(func, (l,))) if verbose: print got[-1] if not (mlist and mlist.internal_name() == i): l.Unlock() del l return got
-- Matt Davis - ICQ# 934680 http://dogpound.vnet.net/
Any nitwit can understand computers. Many do. - Ted Nelson

Hi Harald,
I've finally gotten around to looking at your LockFile.py changes. Note that I checked in a different change to fix the actual problem with "view other subscriptions", but I still want to comment on your diffs.
HM> The above strangeness was probably caused by a buglet in the
HM> current CVS LockFile.py; I believe it gets the order of
HM> unlink() calls wrong.
HM> As I see it, it is more important for the lock file to never
HM> have a link count that is neither 0 or 2 than it is to make
HM> sure there are no tempfile turds. This implies that the real
HM> lock file should be unlink()ed before the tempfile, and not
HM> the other way around. Here's a (untested) patch (which also
HM> touches on some other issues I noticed while I was at it :):
Thinking about this more, I disagree. I think it is perfectly fine to expect the linkcount to be 0, 1, or 2, but never greater than 2. Let's look at each in detail. Remember we're talking about the link count of the shared global file, not the process specific tmp files.
Linkcount == 0: this just means that lockfile doesn't exist, so we don't actually ever get here (we'd get an ENOENT instead).
Linkcount == 1: One way this can happen is due to a race condition between proc1 which is in the middle of unlocking and proc2 which has just laid claim to the lock. Here's how this goes:
-- proc1 had a valid lock and is now in unlock(). It unlinks its
tmpfname, leaving lockfile's linkcount == 1
-- proc2 comes along and links lockfile to /its/ tmpfname. This
fails because lockfile already exists, but linkcount == 1. I
believe we can just ignore this case and loop around to attempt
acquisition again next time.
-- proc1 now unlinks lockfile, relinquishing the lock
-- proc2 loops and attempts to link lockfile -> tmpfname
This situation was broken but only because the big if/else in the OSError clause in lock() didn't take this race condition into account. I think the better fix is to just catch the linkcount==1 case, and essentially ignore it. To me, the more important invariant is that if the global lockfile exists, somebody else has the lock. Changing the order of unlinks breaks this invariant.
Linkcount == 2: Somebody's got a valid lock.
Linkcount > 2: Somebody's messing with us. I cannot see any situation where linkcount would be > 2 without "outside influences".
Okay, now to specific changes in your patch. Apologies for the ugly nature of the citations.
| + # HM: I don't understand why the tempfile should be
| + # unlinked in this situation. I think that all it w
| + # is confuse things, so I've commented it out.
| + ## os.unlink(self.__tmpfname)
You're right, this should not remove the tmpfname. This would be like only half giving up the lock! I've removed this unlink() call.
| - NotLockedError, unless optional `unconditional' is true.
| + NotLockedError, unless optional `unconditionally' is true.
Applied.
| - islocked = 0
| - try:
| - islocked = self.locked()
| - except OSError, e:
| - if e.errno <> errno.ENOENT: raise
| + islocked = self.locked()
Applied.
| - try:
| - winner = self.__read()
| - if winner:
| - os.unlink(winner)
| - except OSError, e:
| - if e.errno <> errno.ENOENT: raise
| + # Try finding the name of the old winner's temp file. We're ass
| + # the winner process has hung or died, so we should try cleaning
| + # their mess.
| + winner = self.__read()
Applied.
HM> I noticed another apparent inconsistency in LockFile.py,
HM> too: The comment at the start of __break() seems to imply
HM> that calling __touch() will totally remove the race
HM> condition, while I believe all it does is make the race
HM> condition a little less likely.
You're right of course. I've updated the comment, but not changed the code because I think we're adequately defensive in the face of this race condition. __break() doesn't acquire a lock and it doesn't matter if two processes attempt to break the same lock at the same time.
Thanks very much for looking at all this. I'm about to check in a new version of LockFile.py which also contains a slightly revised unit test.
Cheers, -Barry

On Thu, 1 Jun 2000, Barry A. Warsaw wrote:
Bingo :) Thanks you 2. For whatever you did its working just fine.
-- Matt Davis - ICQ# 934680 http://dogpound.vnet.net/
Spelling checkers at maximum! Fire!
participants (3)
-
bwarsaw@python.org
-
Harald Meland
-
Matt Davis