Mailman 3 LockFile.py - Mailman-Developers

22 Apr 2000 · *can't*


      A couple of months ago I posted about a few bugs in the Mailman locking
mechanism, most notably that it was using the link() lock method the wrong
way 'round ;) but also some symtomps of this problem: failed locks, broken
locks, very long pauses between locks and unlocks, and a generally high load
on the machine while multiple mailmans were trying to get the lock. I could
trigger this behaviour by sending 10+ messages at the same time to the same
list.
I wrote a slightly better version, and Harald came up and showed me his ;)
Harald's LockFile.py (from his CVS tree) had the link() call the right way
'round, but had some other strange quirks that I disagreed with. So I kept
my version around, and rewrote it some more, using ideas from Haralds'
version, to make it more robust. The result is attached ;)
The 'textbook' example of NFS-compatible locking uses the link() call to
create a (hard) link between the lockfile and a private file (private to the
locking process) and determines wether or not the lock succeeded by
stat()ing the private file and checking that the number of links to the file
is 2. It's done this way because a lot of operations are not atomic, on NFS
filesystems, but the link()/stat() method is (or is supposed to be.)
The current mailman implementation is reversed, somehow. It makes a link
from the lockfile to a private file, and checks the number of links to the
lockfile. If the number of links is not 2, it assumes the lock failed, and
tries to read the contents. If the contents are invalid, or timed out (and
the pid hasn't changed since last time it was checked) the lock gets broken.
If a lock succeeds, the locker writes it's pid, it's private lockfile and
the release time into the file.
The problem is that if many mailmans are locking at the same time, the time
between the link() call, the stat() call and the unlink() call, can easily
be enough for several other processes to execute the link(), thus ending
with a lot of links. Worse, all locking processes use the exact same
sleep_interval, so this condition is likely to continue. Also, there is
enough delay between locking the file and filling it with data, for another
process to discover the file is empty, and remove it. Lastly, the cycle of
lock & test is fairly intensive, and the sleep time not high enough to give
other processes the chance to relinquish the lock.
Though I describe it in nasty terms, it isn't readily apparent in normal
situations :-) I have to stress the list quite a bit to detect these issues,
but the machine I run mailman on can be pretty loaded, and i did see these
effects in the occasional burst of natural traffic.
Harald's LockFile, from his CVS tree, is a lot better -- the locking works
fine, but there are other issues that are not that pretty: refresh() is
tricky, can lose the lock in fact. The same goes for steal(). And the
__try_steal() method is especially nasty: it rename()s its private lockfile
to the lockfile, overwriting any lockfile there, waits for 3*sleep_interval,
and checks to see if anyone else overwrote the lock.
My LockFile jumps through some hoops to do some things as safely as
possible, and it still fails a bit in some places, but it looks a lot safer
than the other two lockfiles. It's also a bit more load-friendly, in that it
only examines the lockfile every 10th (or thereabouts) iteration. This
should be more than enough for normal purposes, and it allows the lock-loop
to be as light as possible, allowing more time for the actual locking
process (if any) to finish its work and unlock ;) And the time the lock()
process sleeps is slightly randomized, to prevent perfectly in-stride
behaviour. (+/- .32 seconds)
I kept the old 'contents of lockfile' trick in place, eventhough I'm not
sure if it's really that useful: the whole problem with locking over NFS is
that you *can't* rely on the contents of a file. So I'm not convinced the
current behaviour (remove the lock if the contents is invalid) is the
correct one. I see how the contents are convenient, especially the timeout
time, but I'm thinking that when a file is empty or invalid, the timeout
should be taken as the MTIME of the file + the default locktime. This should
catch most stale files right away, and prevent a lot of race conditions in
locking/refreshing/stealing/breaking locks. I haven't implemented that yet,
though.
Attached is LockFile.py. I'm using it in production currently, and it seems
to work fine. I dont have code that actually uses steal(), but i tested it
by hand, and it seems to work ok too. There is only one issue with steal:
should it return NotLockedError some such when stealing fails, or return 0,
or keep on trying ?
--
Thomas Wouters <thomas@xs4all.net>
Hi! I'm a .signature virus! copy me into your .signature file to help me spread!

LockFile.py

Thomas Wouters

Ricardo Kustner

tags

participants (2)