[Mailman-Users] Big problems with stale lockfiles on large list...

Tue May 1 22:40:42 CEST 2001

I'm CC'ing Mailman developers because I want to get some feedback from
the Unix hackers amonst us; apologies for duplicates...

    >>  We also experienced similar problems in the past with large
    >> lists. We found that if the admin is accessing the long and
    >> slow-loading members admin pages, and does not wait for the
    >> page to complete (e.g. clicks one of the admin links or hits
    >> refresh or stop) the lock will remain. As a temporary solution
    >> we have instructed our admins to wait for all pages to
    >> completely load, no stale locks since then. This is on a Cobalt
    >> RaQ4i (Redhat) with Apache 1.3x. Hope this helps,

>>>>> "GT" == Graham TerMarsch <mailman at howlingfrog.com> writes:

    GT> Is probably related to what we're experiencing, but I'm
    GT> finding that even just due to the volume of hits on the
    GT> "subscribe" page, that we're having this problem.  While
    GT> testing, I made sure that we had _no_ accesses to admin pages,
    GT> and that all of the "subscribe" hits that were coming through
    GT> waited for the complete response and didn't time out or have
    GT> the "stop" button pressed.  Even in this scenario, I still
    GT> ended up with stale locks lingering around.

    [...]
    GT> Would it be correct to say that if the CGI process dies for
    GT> some unforseen reason (e.g. Apache kills it off because the
    GT> user pressed the "stop" button or the HTTP connection timed
    GT> out), that the lock from that process gets left around as a
    GT> lingering lock?

After an all night hacking jag, lots of google searches, a re-read of
Stevens Ch.10, and a short conversation with Guido, I believe I know
what's going on.  That's the good news.  The bad news is that I don't
think there's a perfect solution, but there probably is a better
solution.

FTR, here's my development environment: Apache 1.3.19, Python 2.1,
RH6.2-ish, kernel 2.2.18, NS 4.74, and the current Mailman 2.1 CVS
snapshot.

Here's what I did: I loaded up a list with 60000 subscribers, then
went to the members page.  It did indeed take a long time, and if I
let it run to completion, I get the page as expected and no locks.

However, if I hit the stop button before the page is finished loading,
I can see that the CGI process continues to run for a while and then
it may or may not clear the locks.  The page is not complete.  Since
sometimes the locks are cleared and sometimes they're left, it's
pretty clear there are race conditions involved.

I did a bit of web searching and here's what I think a stock Apache
mod_cgi is supposed to do in this situation:

- The user hits the stop button on the browser.  The client will close
  the socket, which is usually eventually recognized by the server.

- The cgi script meanwhile is merrily doing some long calculations,
  and eventually it will try to write output.  It's at this point that
  it may be possible to detect that the socket has gone away.  What's
  interesting is that it appears that the cgi script actually writes
  output to the parent Apache process, which perhaps buffers the
  output before sending to the client.  In any event, it's the Apache
  parent process that gets a SIGPIPE indicating the socket it's
  writing to is closed.

- Apache catches the SIGPIPE, turns around and sends a SIGTERM to the
  cgi process.  Apache waits three seconds and if the cgi hasn't
  exited yet, it sends a SIGKILL.

The default behavior for SIGTERM is to terminate the process, so
whether it's the SIGTERM or SIGKILL that does the dirty work, in the
end, the naive cgi script can get summarily killed without a chance at
clean up.

Enter Python.  Python installs a few of its own signal handlers, most
notably for SIGINT which it maps to the KeyboardInterrupt exception.
Normally though, SIGTERM isn't caught by Python, so again, the naive
Python cgi script will summarily die when receiving a SIGTERM.  Under
mod_cgi, even catching SIGTERM may not help if more than 3 seconds of
processing occurs (I don't know if this is clock time or cpu time).
Again, that's because Apache turns around and SIGKILLs the cgi, and
that signal can't be caught (remember, this is what bit us a while
back w.r.t. Postfix filter programs).

So maybe the answer is to install a Python-level SIGTERM handler that
will release the lock.  To understand this, you need to know how
Python handles signals.  When you set a signal handler in Python, at
the C level, Python installs its own signal handler, which simply sets
a flag and returns.  That's because Python will only run Python-level
signal handlers in between bytecode boundaries.  So your Python signal
handler can get run at some arbitrary point in the future, and likely
/not/ when the signal actually occurred.

All this is very dodgy in the face of try/finally clauses, which is
how cgi's like admin.py ensure that the mailing list is saved and
unlocked when an exception occurs.  For example, my initial solution
was to install a signal handler that raised an exception, thinking
that this would trigger the finally clause and do the normal save &
unlock.  But because of race conditions and interactions with the
Python interpreter loop, I often saw that while the exception was
getting raised, the finally clause wasn't always getting executed.
Guido confirmed that this isn't a safe approach.

So my next approach is to write a very minimal signal handler that
only unlocks the list, and install this on SIGTERM.  The trick is to
make the signal handler, e.g. a nested function of main() in admin.py,
and pass the MailList object into it using the default argument
trick.

This seems to work, in that the locks appear to be cleared in the
cases where they were left laying around before.  But because of all
the race conditions, I can't be 100% sure.

If you've read this far, the implication is that if the user hits the
stop button, Mailman will in essence abort any changes to list
configuration that this invocation may have made.  Alternatively, we
could try to save & unlock in the signal handler, but that raises the
possibility of race conditions again.  Also, it makes sense to move
the save of the list data into the try: part of the clause and only do
the unlocking in the finally.  That way, the finally clause and the
SIGTERM handler have the same semantics, and the list will get
unlocked in the face of either an exception or a signal.  But the list
database will only get saved on sucessful completion of the task.  I
can live with those semantics (I think ;).

So the code in admin.py looks something like:

def main():
    ...
    mlist = MailList.MailList(listname, lock=0)
    ...

    def sigterm_handler(signum, frame, mlist=mlist):
	mlist.Unlock()

    mlist.Lock()
    try:
	signal.signal(signal.SIGTERM, sigterm_handler)
	...
	mlist.Save()
    finally:
	mlist.Unlock()

I think there are still opportunities to get nailed, but I think this
works as well as is possible.  In my testing I've been able to clear
the locks when the stop button is pushed.  In the rare event that the
list's config.db file gets corrupted (i.e. the handler gets called
in the middle of mlist.Save()), we've always got the config.db.last
file to fall back on.

The other good news is that I think only admindb.py, admin.py, and
confirm.py, need to be modified, since those are the only scripts that
write to list attributes.  I've been pretty diligent in Mailman 2.1 to
only lock the mailing lists when necessary, so if the scripts are
read-only, they don't acquire the lock (and get the state of the
database as of the time of the initial Load()).

Phew!

-Barry