[ mailman-Patches-1045656 ] Race in LockFile.py resulting in extraneous log entries.
Patches item #1045656, was opened at 2004-10-12 19:50 Message generated for change (Comment added) made by tkikuchi You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=300103&aid=1045656&group_id=103 Category: mail delivery Group: Mailman 2.1 Status: Open
Resolution: Postponed Priority: 3 Submitted By: Brian Greenberg (grnbrg) Assigned to: Tokio Kikuchi (tkikuchi) Summary: Race in LockFile.py resulting in extraneous log entries.
Initial Comment: There is a race condition (that does not affect correct operation of Mailman) in .../Mailman/LockFile.py. This race results in entries to the "locks" log file (usually in pairs) that are unnecessary and confusing. If there is a process holding a lock for a file, and a process waiting for that lock to be freed, the following sequence can occur: 1) The waiting process executes os.link(). Since the running process still has the lock, this fails with EEXIST. 2) The running process releases the lock, and removes the lock file. 3) The waiting process proceeds with it's checks, and checks self.__linkcount() which returns -1 due to an ENOENT error. This throws an error into the "locks" log file. 4) The waiting process continues processing, checks it's timeout, and then checks to see if the lock lifetime has expired. The lifetime check is based on a call to self.__releasetime(), which returns -1, due to an ENOENT error. This results in the lockfile being declared stale, and self.__break() is called. This then throws a second error into the "locks" log file. The attached patch checks for these ENOENT errors, immediately restarting the loop when they are detected. It does not affect operation is the lock file exists and is held by another process. Brian Greenberg. ----------------------------------------------------------------------
Comment By: Tokio Kikuchi (tkikuchi) Date: 2004-10-25 00:02
Message: Logged In: YES user_id=67709 OK. I found the log output in my solaris installation. But, I can't reproduce the said log-pair using bin/withlist. It looks like occurrence of this log output is unpredictable and need more checking before applying the patch. I found [unexpected linkcount: -1/lifetime has expired, breaking] pair in Solaris [lifetime has expired, breaking] only in BSD/OS no locks log in FreeBSD So, the behaviour is highly OS dependent I guess. ---------------------------------------------------------------------- Comment By: Richard Barrett (ppsys) Date: 2004-10-24 08:02 Message: Logged In: YES user_id=75166 The comments in the LockFile.py header code make the point that the entire mechanism is to deal with NFS robustness/compatibilty issues. The extranenous logging that concerns grnbrg is appearing on Solaris (version and build unspecified) and attempts to reproduce it, that have thus far failed, appear to be on FreeBSD. I suspect that the majority of MM installs are on various versions and builds of Linux. These OSen will all have substantially different NFS implementations and major differences in their kernel that probably affect NFS implementation and operation. Because of these differences I have already encountered sufficient problems with interworking of Linux NFS clients with Solaris NFS servers to convince me that changing LockFile.py without sufficient testing in order to clear an essentially cosmetic problem on one host OS is a an unecessary risk to stability across the board. Just testing on one host is not good enough and reading code even less so; it is the dynamics of operation in live situations that are the issue. A test programme needs to at least nod in the direction of heterogenous NFS client/host configurations of various types being in used for validation and verification of any changes. ---------------------------------------------------------------------- Comment By: Brian Greenberg (grnbrg) Date: 2004-10-22 18:53 Message: Logged In: YES user_id=902583 I'm running on a Solaris box, which may or may not have any impact on the issue. I'm not sure how to force reproduction -- races are hard to intentially duplicate. I first noticed the problem under a heavy load due to an admin "oops" -- 30,000+ news articles being gatewayed into a test list. :) Once the oops was fixed, the error logs continued to occasionally appear, and rather than assume that everything was ok, I tracked down the source of the problem. I did this by desk-checking the source, and identified the race by a bit of guesswork. Once these changes were made, the problem stopped. The race occurs as follows: 1) A qrunner runs LockFile.lock() to request a particular lock. 2) At LockFile.py:256, this qrunner attempts to link it's temp lock file to the global lockfile. This link fails with a EEXIST error, because another process currently holds the lock. The assumption is made at this point that the global lockfile exists. This (false) assumption continues through to the end of the loop at LockFile.py:312. In fact, the global lockfile may cease to exist (because the qrunner that has it calls unlock()) at any time. 3) The process holding the lock calls unlock(). 4) The waiting process (which just got an EEXIST) goes through it's various sanity checks, which include checking the linkcount and mtime of the global lockfile, which now does not exist. Since it is assumed to exist, the -1 results returned make no sense, and are unconditionally logged. 5) At LockFile.py:303, the (incorrectly assumed) existing lock is broken, and control goes back to the top of the loop (after a quick __sleep()) As far as impact, the patch should have minimal side effects. I pulled the function calls out of the if statements, so there is no change in the number of calls. Further, all the new code does is to check for the "-1" return that indicates that the global lockfile no longer exists, and immediately restarts the loop if that is the case. It might not be a bad idea to add a "self.__sleep()" before the continue, but I didn't think it was needed. Hope that adds enough detail to trace what I think is happening, and verify that it won't break anything. FWIW, I have not had any problems, nor any errors logged since applying this locally. ---------------------------------------------------------------------- Comment By: Barry A. Warsaw (bwarsaw) Date: 2004-10-22 12:06 Message: Logged In: YES user_id=12800 I haven't had time to look at this in detail, but I'll just include a general warning that LockFile.py is /very/ fragile. It's also critical to the proper operation of Mailman. So any changes, even seemingly innocent onces, must be very thoroughly tested. My inclination would be to reject any change that wasn't absolutely necessary, but I won't veto this change if it is well-tested and fixes a verified bug (tho' it sounds like tkikuchi can't reproduce it). ---------------------------------------------------------------------- Comment By: Tokio Kikuchi (tkikuchi) Date: 2004-10-22 08:26 Message: Logged In: YES user_id=67709 Will you be more specific how this race occurs? My installation on FreeBSD doesn't looks like show this symptom in locks log file. I tested with two terminals invoking python -i bin/withlist testlist and try to lock with m.Lock() in both the terminal. One of term is forced to wait until the I type m.Unlock() in the other term. I then looked into logs dir but nothing avail. ---------------------------------------------------------------------- Comment By: Brian Greenberg (grnbrg) Date: 2004-10-15 14:22 Message: Logged In: YES user_id=902583 This would fix this bug, as long as CLOCK_SLOP is large enough. But I don't think it's the Right Thing, as you are returning a result that says that the lockfile has a lifetime, when it in fact simply no longer exists. For a function whose purpose is to report the mtime of the master lockfile, -1 is the correct response if the file doesn't exist. The code that calls that function should check for and interpret that condition. ---------------------------------------------------------------------- Comment By: SHIGENO Kazutaka (shigeno) Date: 2004-10-15 00:11 Message: Logged In: YES user_id=1138453 How about changing as follows about self.__releasetime()? @@ -448,7 +448,7 @@ return os.stat(self.__lockfile)[ST_MTIME] except OSError, e: if e.errno <> errno.ENOENT: raise - return -1 + return time.time() def __linkcount(self): try: ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=300103&aid=1045656&group_id=103
participants (1)
-
SourceForge.net