[Mailman-Developers] race condition in locking ?

Thomas Wouters thomas@xs4all.net
Wed, 2 Feb 2000 14:00:42 +0100


I've finally found the time to do some serious testing of mailman (that is,
set up a mailman list and play with it ;) and I was hit by what looks like a
race condition in the locking. I was sending a lot of small mails to the
mailman-test list, using

echo "foobar foobar baz baz baz" | mail -s "foobar" mailman-test@xs4all.nl

and just manually hitting arrow-up, enter, arrow-up, enter, etc. I'd
subscribed a bouncing email address and wanted to see what would happen ;)

In any case, the machine mailman runs on is also used for some other
things; majordomo and our 'batched-smtp' product. Both services generate
very variable loads, majordomo because of some very large lists, and
batched-smtp due to customers with very large mailqueues dialing in and
starting a queuerun.

I dont send the mail from the same machine, so the mails I sent probably
arrived in one sendmail session, and then were executed by the queueing
sendmail right after another. Some of the messages (around 10/15 per
incident) I'd sent then bounced, most with the following traceback:

Traceback (innermost last):
  File "/usr/local/mailman/scripts/mailowner", line 69, in ?
    main()
  File "/usr/local/mailman/scripts/mailowner", line 48, in main
    mlist = MailList.MailList(sys.argv[1])
  File "/usr/local/mailman/Mailman/MailList.py", line 69, in __init__
    self.Load()
  File "/usr/local/mailman/Mailman/MailList.py", line 824, in Load
    self.Lock()
  File "/usr/local/mailman/Mailman/MailList.py", line 1255, in Lock
    self.__lock.lock()
  File "/usr/local/mailman/Mailman/LockFile.py", line 209, in lock
    os.link(self.__lockfile, self.__tmpfname)
OSError: [Errno 2] No such file or directory

And a couple with the following traceback:

Traceback (innermost last):
  File "/usr/local/mailman/scripts/post", line 86, in ?
    main()
  File "/usr/local/mailman/scripts/post", line 62, in main
    mlist.Post(msg)
  File "/usr/local/mailman/Mailman/MailList.py", line 1249, in Post  
    self.Save()
  File "/usr/local/mailman/Mailman/MailList.py", line 814, in Save 
    Utils.reraise()
  File "/usr/local/mailman/Mailman/MailList.py", line 811, in Save 
    os.link(fname, fname_last)
OSError: [Errno 17] File exists

The first traceback is probably because one process does a kickstart,
erasing the lockfile and starting a new one, while a lot of other processes
are trying to link() to it. The second traceback looks like a lock was
stolen from an active process, in other words a real locking failure.

The lock-logfile suggests some inefficient locking, too; long pauses between
unlocks and locks:

(18502) mailman-test.lock waiting for claim
(18492) mailman-test.lock unlocked
(18513) mailman-test.lock waiting for claim
(18584) mailman-test.lock waiting for claim
Feb 01 18:53:08 2000 (18622) mailman-test.lock kickstarted
(18425) mailman-test.lock waiting for claim
(18504) mailman-test.lock waiting for claim
(18456) mailman-test.lock waiting for claim
(18440) mailman-test.lock waiting for claim
(18622) mailman-test.lock waiting for claim
(18467) mailman-test.lock waiting for claim
(18584) mailman-test.lock waiting for claim
(18443) mailman-test.lock waiting for claim
(18500) mailman-test.lock waiting for claim
(18502) mailman-test.lock waiting for claim
(18513) mailman-test.lock waiting for claim
(18526) mailman-test.lock waiting for claim
(18504) mailman-test.lock waiting for claim
(18440) mailman-test.lock waiting for claim
(18425) mailman-test.lock waiting for claim
(18456) mailman-test.lock waiting for claim
(18622) mailman-test.lock waiting for claim
(18443) mailman-test.lock waiting for claim
(18513) mailman-test.lock waiting for claim
(18467) mailman-test.lock waiting for claim
(18584) mailman-test.lock waiting for claim
(18500) mailman-test.lock waiting for claim
(18502) mailman-test.lock waiting for claim
(18526) mailman-test.lock waiting for claim
(18440) mailman-test.lock waiting for claim
(18504) mailman-test.lock waiting for claim
(18425) mailman-test.lock waiting for claim
(18456) mailman-test.lock waiting for claim
(18622) mailman-test.lock waiting for claim
(18443) mailman-test.lock got the lock

(This is with the default sleep time of .25)          

This all led me to believe there was something wrong with locking, and I
spent the mornings' traintrip from home to work (1.5 hours) reading
LockFile.py and thinking about the locking. And I think the logic of
LockFile.lock() is flawed :P

lock() seems to rely on two things: os.stat() returning the right number of
links for the lockfile (which is a fairly safe assumption) and the contents
of the lockfile, which should give an indication as to who locked it and for
how long. But there is at least one situation possible where one process
thinks it has the lock, and another thinks the lock is faulty and does a
forced kickstart:

	process 1			process 2

	os.link(lockfile,tmpfile)
	os.stat(lockfile) == 2
					os.link(lockfile, tmpfile)
					os.stat(lockfile) == 3
	self.__write():
		open(tmpfile, "w")
					self.__read()
		write(os.getpid(), etc)
					__kickstart(force=1)
		return

I think I was able to reproduce this on my laptop by stressing it fairly
hard (load of 15 and rising.) The testcode only acquired and released locks,
with some small sleep inbetween, but I saw several double-locks, and one
OSError on the os.link(). 

I also think I see a way to fix this problem, by checking the mtime of the
file if the __read() fails. I think we can safely say that if the __read()
fails AND the mtime is recent, someone else is either assuming they have the
lock, or is kickstart()ing, in both of which cases we should step back and
start over.

Comments ? Am I trespassing on anyones' work here ? Is this part of mailman
still being looked at ?

-- 
Thomas Wouters <thomas@xs4all.net>

Hi! I'm a .signature virus! copy me into your .signature file to help me spread!