[ mailman-Patches-1045656 ] Race in LockFile.py resulting in extraneous log entries.

Sun Oct 24 10:02:13 CEST 2004

Patches item #1045656, was opened at 2004-10-12 19:50
Message generated for change (Comment added) made by ppsys
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=300103&aid=1045656&group_id=103

Category: mail delivery
Group: Mailman 2.1
Status: Open
Resolution: None
Priority: 3
Submitted By: Brian Greenberg (grnbrg)
Assigned to: Tokio Kikuchi (tkikuchi)
Summary: Race in LockFile.py resulting in extraneous log entries.

Initial Comment:
There is a race condition (that does not affect correct
operation of Mailman) in .../Mailman/LockFile.py.  This
race results in entries to the "locks" log file
(usually in pairs) that are unnecessary and confusing.

If there is a process holding a lock for a file, and a
process waiting for that lock to be freed, the
following sequence can occur:

1)  The waiting process executes os.link().  Since the
running process still has the lock, this fails with EEXIST.
2)  The running process releases the lock, and removes
the lock file.
3)  The waiting process proceeds with it's checks, and
checks self.__linkcount() which returns -1 due to an
ENOENT error.  This throws an error into the "locks"
log file.
4)  The waiting process continues processing, checks
it's timeout, and then checks to see if the lock
lifetime has expired.  The lifetime check is based on a
call to self.__releasetime(), which returns -1, due to
an ENOENT error.  This results in the lockfile being
declared stale, and self.__break() is called.  This
then throws a second error into the "locks" log file.

The attached patch checks for these ENOENT errors,
immediately restarting the loop when they are detected.
 It does not affect operation is the lock file exists
and is held by another process.

Brian Greenberg.

----------------------------------------------------------------------

Comment By: Richard Barrett (ppsys)
Date: 2004-10-24 08:02

Message:
Logged In: YES 
user_id=75166

The comments in the LockFile.py header code make the point that the 
entire mechanism is to deal with NFS robustness/compatibilty issues. The 
extranenous logging that concerns grnbrg is appearing on Solaris 
(version and build unspecified) and attempts to reproduce it, that have 
thus far failed, appear to be on FreeBSD. I suspect that the majority of 
MM installs are on various versions and builds of Linux. These OSen will 
all have substantially different NFS implementations and major 
differences in their kernel that probably affect NFS implementation and 
operation. Because of these differences I have already encountered 
sufficient problems with interworking of Linux NFS clients with Solaris 
NFS servers to convince me that changing LockFile.py without sufficient 
testing in order to clear an essentially cosmetic problem on one host OS 
is a an unecessary risk to stability across the board. Just testing on one 
host is not good enough and reading code even less so; it is the 
dynamics of operation in live situations that are the issue. A test 
programme needs to at least nod in the direction of heterogenous NFS 
client/host configurations of various types being in used for validation 
and verification of any changes.


----------------------------------------------------------------------

Comment By: Brian Greenberg (grnbrg)
Date: 2004-10-22 18:53

Message:
Logged In: YES 
user_id=902583

I'm running on a Solaris box, which may or may not have any
impact on the issue.

I'm not sure how to force reproduction -- races are hard to
intentially duplicate.  I first noticed the problem under a
heavy load due to an admin "oops" -- 30,000+ news articles
being gatewayed into a test list.  :)  Once the oops was
fixed, the error logs continued to occasionally appear, and
rather than assume that everything was ok, I tracked down
the source of the problem.  I did this by desk-checking the
source, and identified the race by a bit of guesswork.  Once
these changes were made, the problem stopped.

The race occurs as follows:

1) A qrunner runs LockFile.lock() to request a particular lock.
2) At LockFile.py:256, this qrunner attempts to link it's
temp lock file to the global lockfile.  This link fails with
a EEXIST error, because another process currently holds the
lock.  The assumption is made at this point that the global
lockfile exists.  This (false) assumption continues through
to the end of the loop at LockFile.py:312.  In fact, the
global lockfile may cease to exist (because the qrunner that
has it calls unlock()) at any time.
3) The process holding the lock calls unlock().
4) The waiting process (which just got an EEXIST) goes
through it's various sanity checks, which include checking
the linkcount and mtime of the global lockfile, which now
does not exist.  Since it is assumed to exist, the -1
results returned make no sense, and are unconditionally logged.
5) At LockFile.py:303, the (incorrectly assumed) existing
lock is broken, and control goes back to the top of the loop
(after a quick __sleep())

As far as impact, the patch should have minimal side
effects.  I pulled the function calls out of the if
statements, so there is no change in the number of calls. 
Further, all the new code does is to check for the "-1"
return that indicates that the global lockfile no longer
exists, and immediately restarts the loop if that is the
case.  It might not be a bad idea to add a "self.__sleep()"
before the continue, but I didn't think it was needed.

Hope that adds enough detail to trace what I think is
happening, and verify that it won't break anything.

FWIW, I have not had any problems, nor any errors logged
since applying this locally.

----------------------------------------------------------------------

Comment By: Barry A. Warsaw (bwarsaw)
Date: 2004-10-22 12:06

Message:
Logged In: YES 
user_id=12800

I haven't had time to look at this in detail, but I'll just
include a general warning that LockFile.py is /very/
fragile.  It's also critical to the proper operation of
Mailman.  So any changes, even seemingly innocent onces,
must be very thoroughly tested.  My inclination would be to
reject any change that wasn't absolutely necessary, but I
won't veto this change if it is well-tested and fixes a
verified bug (tho' it sounds like tkikuchi can't reproduce it).

----------------------------------------------------------------------

Comment By: Tokio Kikuchi (tkikuchi)
Date: 2004-10-22 08:26

Message:
Logged In: YES 
user_id=67709

Will you be more specific how this race occurs? My
installation on FreeBSD doesn't looks like show this symptom
in locks log file.
I tested with two terminals invoking python -i bin/withlist
testlist and try to lock with m.Lock() in both the terminal.
One of term is forced to wait until the I type m.Unlock() in
the other term. I then looked into logs dir but nothing avail.


----------------------------------------------------------------------

Comment By: Brian Greenberg (grnbrg)
Date: 2004-10-15 14:22

Message:
Logged In: YES 
user_id=902583

This would fix this bug, as long as CLOCK_SLOP is large enough.

But I don't think it's the Right Thing, as you are returning
a result that says that the lockfile has a lifetime, when it
in fact simply no longer exists.  For a function whose
purpose is to report the mtime of the master lockfile, -1 is
the correct response if the file doesn't exist.  The code
that calls that function should check for and interpret that
condition.

----------------------------------------------------------------------

Comment By: SHIGENO Kazutaka (shigeno)
Date: 2004-10-15 00:11

Message:
Logged In: YES 
user_id=1138453

How about changing as follows about self.__releasetime()?

@@ -448,7 +448,7 @@
             return os.stat(self.__lockfile)[ST_MTIME]
         except OSError, e:
             if e.errno <> errno.ENOENT: raise
-            return -1
+            return time.time()

     def __linkcount(self):
         try:


----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=300103&aid=1045656&group_id=103