[Mailman-Developers] problem with view other subscriptions..

Barry A. Warsaw bwarsaw@python.org
Thu, 1 Jun 2000 17:49:47 -0400 (EDT)


Hi Harald,

I've finally gotten around to looking at your LockFile.py changes.
Note that I checked in a different change to fix the actual problem
with "view other subscriptions", but I still want to comment on your
diffs.

    HM>  The above strangeness was probably caused by a buglet in the
    HM> current CVS LockFile.py; I believe it gets the order of
    HM> unlink() calls wrong.

    HM> As I see it, it is more important for the lock file to never
    HM> have a link count that is neither 0 or 2 than it is to make
    HM> sure there are no tempfile turds.  This implies that the real
    HM> lock file should be unlink()ed before the tempfile, and not
    HM> the other way around.  Here's a (untested) patch (which also
    HM> touches on some other issues I noticed while I was at it :):

Thinking about this more, I disagree.  I think it is perfectly fine to
expect the linkcount to be 0, 1, or 2, but never greater than 2.
Let's look at each in detail.  Remember we're talking about the link
count of the shared global file, not the process specific tmp files.

Linkcount == 0: this just means that lockfile doesn't exist, so we
don't actually ever get here (we'd get an ENOENT instead).

Linkcount == 1: One way this can happen is due to a race condition
between proc1 which is in the middle of unlocking and proc2 which has
just laid claim to the lock.  Here's how this goes:

    -- proc1 had a valid lock and is now in unlock().  It unlinks its
       tmpfname, leaving lockfile's linkcount == 1

    -- proc2 comes along and links lockfile to /its/ tmpfname.  This
       fails because lockfile already exists, but linkcount == 1.  I
       believe we can just ignore this case and loop around to attempt
       acquisition again next time.

    -- proc1 now unlinks lockfile, relinquishing the lock

    -- proc2 loops and attempts to link lockfile -> tmpfname

This situation was broken but only because the big if/else in the
OSError clause in lock() didn't take this race condition into account.
I think the better fix is to just catch the linkcount==1 case, and
essentially ignore it.  To me, the more important invariant is that if
the global lockfile exists, somebody else has the lock.  Changing the
order of unlinks breaks this invariant.

Linkcount == 2: Somebody's got a valid lock.

Linkcount > 2: Somebody's messing with us.  I cannot see any situation
where linkcount would be > 2 without "outside influences".

Okay, now to specific changes in your patch.  Apologies for the ugly
nature of the citations.

    | +                    # HM: I don't understand why the tempfile should be
    | +                    # unlinked in this situation.  I think that all it w
    | +                    # is confuse things, so I've commented it out.
    | +                    ## os.unlink(self.__tmpfname)

You're right, this should not remove the tmpfname.  This would be like
only half giving up the lock!  I've removed this unlink() call.

    | -        NotLockedError, unless optional `unconditional' is true.
    | +        NotLockedError, unless optional `unconditionally' is true.

Applied.

    | -        islocked = 0
    | -        try:
    | -            islocked = self.locked()
    | -        except OSError, e:
    | -            if e.errno <> errno.ENOENT: raise
    | +        islocked = self.locked()

Applied.

    | -        try:
    | -            winner = self.__read()
    | -            if winner:
    | -                os.unlink(winner)
    | -        except OSError, e:
    | -            if e.errno <> errno.ENOENT: raise
    | +        # Try finding the name of the old winner's temp file.  We're ass
    | +        # the winner process has hung or died, so we should try cleaning
    | +        # their mess.
    | +        winner = self.__read()

Applied.
    
        HM> I noticed another apparent inconsistency in LockFile.py,
        HM> too: The comment at the start of __break() seems to imply
        HM> that calling __touch() will totally remove the race
        HM> condition, while I believe all it does is make the race
        HM> condition a little less likely.

You're right of course.  I've updated the comment, but not changed the
code because I think we're adequately defensive in the face of this
race condition.  __break() doesn't acquire a lock and it doesn't
matter if two processes attempt to break the same lock at the same
time.

Thanks very much for looking at all this.  I'm about to check in a new
version of LockFile.py which also contains a slightly revised unit
test.

Cheers,
-Barry