Big problems with stale lockfiles on large list...
![](https://secure.gravatar.com/avatar/24b55bf954fdd812040daa1db3ff9336.jpg?s=120&d=mm&r=g)
Running Mailman-2.0.1, with Python 1.5.2 on a RedHat 6.2 machine along w/Apache-1.3.14 and Sendmail-8.11.0, and am having some serious grief with stale lockfiles on one of our lists. List contains ~60k addresses on it, and has constant traffic to the WWW administration pages. Not high volume for sending msgs though (only one or two a day) as its a broadcast/announce list.
I'm finding, though, that the WWW processes are regularly creating stale locks that sit around and hold everything up. For fun, I tried using "ab" (ApacheBench) to fire up five concurrent "subscribe" requests to our box, and saw that it regularly ended up creating stale locks and blocking out the rest of the system. Worse yet, I'm not seeing any useful information in the "logs/errors" file, nor am I getting anything useful in "logs/locks" (other than seeing that some process got a lock, did its thing, and then shut itself down _WITHOUT_ releasing the lock).
I'm presuming that what I'm running into is more of a config/tuning issue than a serious bug, as I'm sure I can't be the only person running lists this large.
Any and all information, tips, pointers, or suggestions are welcome.
-- Graham TerMarsch
![](https://secure.gravatar.com/avatar/b26a573943ef4f62e9f6c2015eeba9ac.jpg?s=120&d=mm&r=g)
Graham,
We also experienced similar problems in the past with large lists. We found that if the admin is accessing the long and slow-loading members admin pages, and does not wait for the page to complete (e.g. clicks one of the admin links or hits refresh or stop) the lock will remain. As a temporary solution we have instructed our admins to wait for all pages to completely load, no stale locks since then. This is on a Cobalt RaQ4i (Redhat) with Apache 1.3x. Hope this helps,
Gergo
Do You Yahoo!? Get your free @yahoo.com address at http://mail.yahoo.com
![](https://secure.gravatar.com/avatar/24b55bf954fdd812040daa1db3ff9336.jpg?s=120&d=mm&r=g)
On Friday 27 April 2001 13:46, Gergo Soros wrote:
Is probably related to what we're experiencing, but I'm finding that even just due to the volume of hits on the "subscribe" page, that we're having this problem. While testing, I made sure that we had _no_ accesses to admin pages, and that all of the "subscribe" hits that were coming through waited for the complete response and didn't time out or have the "stop" button pressed. Even in this scenario, I still ended up with stale locks lingering around.
Would it be correct to say that if the CGI process dies for some unforseen reason (e.g. Apache kills it off because the user pressed the "stop" button or the HTTP connection timed out), that the lock from that process gets left around as a lingering lock?
-- Graham TerMarsch
![](https://secure.gravatar.com/avatar/cb6b2a19d7ea20358a4c4f0332afc3ef.jpg?s=120&d=mm&r=g)
I'm CC'ing Mailman developers because I want to get some feedback from the Unix hackers amonst us; apologies for duplicates...
>> We also experienced similar problems in the past with large
>> lists. We found that if the admin is accessing the long and
>> slow-loading members admin pages, and does not wait for the
>> page to complete (e.g. clicks one of the admin links or hits
>> refresh or stop) the lock will remain. As a temporary solution
>> we have instructed our admins to wait for all pages to
>> completely load, no stale locks since then. This is on a Cobalt
>> RaQ4i (Redhat) with Apache 1.3x. Hope this helps,
"GT" == Graham TerMarsch <mailman@howlingfrog.com> writes:
GT> Is probably related to what we're experiencing, but I'm
GT> finding that even just due to the volume of hits on the
GT> "subscribe" page, that we're having this problem. While
GT> testing, I made sure that we had _no_ accesses to admin pages,
GT> and that all of the "subscribe" hits that were coming through
GT> waited for the complete response and didn't time out or have
GT> the "stop" button pressed. Even in this scenario, I still
GT> ended up with stale locks lingering around.
[...]
GT> Would it be correct to say that if the CGI process dies for
GT> some unforseen reason (e.g. Apache kills it off because the
GT> user pressed the "stop" button or the HTTP connection timed
GT> out), that the lock from that process gets left around as a
GT> lingering lock?
After an all night hacking jag, lots of google searches, a re-read of Stevens Ch.10, and a short conversation with Guido, I believe I know what's going on. That's the good news. The bad news is that I don't think there's a perfect solution, but there probably is a better solution.
FTR, here's my development environment: Apache 1.3.19, Python 2.1, RH6.2-ish, kernel 2.2.18, NS 4.74, and the current Mailman 2.1 CVS snapshot.
Here's what I did: I loaded up a list with 60000 subscribers, then went to the members page. It did indeed take a long time, and if I let it run to completion, I get the page as expected and no locks.
However, if I hit the stop button before the page is finished loading, I can see that the CGI process continues to run for a while and then it may or may not clear the locks. The page is not complete. Since sometimes the locks are cleared and sometimes they're left, it's pretty clear there are race conditions involved.
I did a bit of web searching and here's what I think a stock Apache mod_cgi is supposed to do in this situation:
The user hits the stop button on the browser. The client will close the socket, which is usually eventually recognized by the server.
The cgi script meanwhile is merrily doing some long calculations, and eventually it will try to write output. It's at this point that it may be possible to detect that the socket has gone away. What's interesting is that it appears that the cgi script actually writes output to the parent Apache process, which perhaps buffers the output before sending to the client. In any event, it's the Apache parent process that gets a SIGPIPE indicating the socket it's writing to is closed.
Apache catches the SIGPIPE, turns around and sends a SIGTERM to the cgi process. Apache waits three seconds and if the cgi hasn't exited yet, it sends a SIGKILL.
The default behavior for SIGTERM is to terminate the process, so whether it's the SIGTERM or SIGKILL that does the dirty work, in the end, the naive cgi script can get summarily killed without a chance at clean up.
Enter Python. Python installs a few of its own signal handlers, most notably for SIGINT which it maps to the KeyboardInterrupt exception. Normally though, SIGTERM isn't caught by Python, so again, the naive Python cgi script will summarily die when receiving a SIGTERM. Under mod_cgi, even catching SIGTERM may not help if more than 3 seconds of processing occurs (I don't know if this is clock time or cpu time). Again, that's because Apache turns around and SIGKILLs the cgi, and that signal can't be caught (remember, this is what bit us a while back w.r.t. Postfix filter programs).
So maybe the answer is to install a Python-level SIGTERM handler that will release the lock. To understand this, you need to know how Python handles signals. When you set a signal handler in Python, at the C level, Python installs its own signal handler, which simply sets a flag and returns. That's because Python will only run Python-level signal handlers in between bytecode boundaries. So your Python signal handler can get run at some arbitrary point in the future, and likely /not/ when the signal actually occurred.
All this is very dodgy in the face of try/finally clauses, which is how cgi's like admin.py ensure that the mailing list is saved and unlocked when an exception occurs. For example, my initial solution was to install a signal handler that raised an exception, thinking that this would trigger the finally clause and do the normal save & unlock. But because of race conditions and interactions with the Python interpreter loop, I often saw that while the exception was getting raised, the finally clause wasn't always getting executed. Guido confirmed that this isn't a safe approach.
So my next approach is to write a very minimal signal handler that only unlocks the list, and install this on SIGTERM. The trick is to make the signal handler, e.g. a nested function of main() in admin.py, and pass the MailList object into it using the default argument trick.
This seems to work, in that the locks appear to be cleared in the cases where they were left laying around before. But because of all the race conditions, I can't be 100% sure.
If you've read this far, the implication is that if the user hits the stop button, Mailman will in essence abort any changes to list configuration that this invocation may have made. Alternatively, we could try to save & unlock in the signal handler, but that raises the possibility of race conditions again. Also, it makes sense to move the save of the list data into the try: part of the clause and only do the unlocking in the finally. That way, the finally clause and the SIGTERM handler have the same semantics, and the list will get unlocked in the face of either an exception or a signal. But the list database will only get saved on sucessful completion of the task. I can live with those semantics (I think ;).
So the code in admin.py looks something like:
def main(): ... mlist = MailList.MailList(listname, lock=0) ...
def sigterm_handler(signum, frame, mlist=mlist):
mlist.Unlock()
mlist.Lock()
try:
signal.signal(signal.SIGTERM, sigterm_handler)
...
mlist.Save()
finally:
mlist.Unlock()
I think there are still opportunities to get nailed, but I think this works as well as is possible. In my testing I've been able to clear the locks when the stop button is pushed. In the rare event that the list's config.db file gets corrupted (i.e. the handler gets called in the middle of mlist.Save()), we've always got the config.db.last file to fall back on.
The other good news is that I think only admindb.py, admin.py, and confirm.py, need to be modified, since those are the only scripts that write to list attributes. I've been pretty diligent in Mailman 2.1 to only lock the mailing lists when necessary, so if the scripts are read-only, they don't acquire the lock (and get the state of the database as of the time of the initial Load()).
Phew!
-Barry
![](https://secure.gravatar.com/avatar/2206e8a0d58563f815a7568ea6675313.jpg?s=120&d=mm&r=g)
On 5/1/01 1:40 PM, "Barry A. Warsaw" <barry@digicool.com> wrote:
That would match something I've been seeing and sorta tring to debug (except it seems for me it only happens when I'm not watching, of course). And once someone hits stop and reloads, they hang waiting for a lock, and so do others -- and I can come in and check in a morning and see 50 broken locks for a list...
So my next approach is to write a very minimal signal handler that only unlocks the list, and install this on SIGTERM.
That's work, but I suggest an alternative, or a second item: when you try to set a lock, and one is set, see if the process that set the lock still exists (the info is available in the locks/ dir with a bit of poking). If that process is gone, delete the lock and move forward.
That way, both sides of the equation can fix the problem if needed.
As it should, IMHO. The only caveat, I think, is that you need to look through the code for places where breaking in the middle can leave you with incomplete or corrupted data, and protect those pieces from breakage, and handle the interrupt once you leave them.
If you can be sure that won't happen, great. But I'd make double-sure...
![](https://secure.gravatar.com/avatar/cb6b2a19d7ea20358a4c4f0332afc3ef.jpg?s=120&d=mm&r=g)
"CVR" == Chuq Von Rospach <chuqui@plaidworks.com> writes:
>> However, if I hit the stop button before the page is finished
>> loading, I can see that the CGI process continues to run for a
>> while and then it may or may not clear the locks.
CVR> That would match something I've been seeing and sorta tring
CVR> to debug (except it seems for me it only happens when I'm not
CVR> watching, of course).
Same here, the first 15 or so times I tried to recreate the bug. I had nearly as long a message already composed that contained a different evaluation and summarized as "works for me". Had to delete that and start over when I actually /did/ reproduce it. ;)
>> So my next approach is to write a very minimal signal handler
>> that only unlocks the list, and install this on SIGTERM.
CVR> That's work, but I suggest an alternative, or a second item:
CVR> when you try to set a lock, and one is set, see if the
CVR> process that set the lock still exists (the info is available
CVR> in the locks/ dir with a bit of poking). If that process is
CVR> gone, delete the lock and move forward.
CVR> That way, both sides of the equation can fix the problem if
CVR> needed.
I've avoid that because of NFS issues, i.e. if you've got multiple Mailman installations sharing an NFS partition, the pids aren't relevant. The program can't know that, but the sysadmin can, so I'm inclined to instead write a script that will zap old locks if their processes don't exist. That way the site admin can run those scripts as he sees appropriate based on his installation.
>> If you've read this far, the implication is that if the user
>> hits the stop button, Mailman will in essence abort any changes
>> to list configuration that this invocation may have made.
CVR> As it should, IMHO. The only caveat, I think, is that you
CVR> need to look through the code for places where breaking in
CVR> the middle can leave you with incomplete or corrupted data,
CVR> and protect those pieces from breakage, and handle the
CVR> interrupt once you leave them.
CVR> If you can be sure that won't happen, great. But I'd make
CVR> double-sure...
I think the only critical section is MailList.Save(), or more accurately, MailList.__save(). But even here I think you're as safe as possible because Mailman writes new state using the following algorithm:
open a config.db.tmp.hostname.pid file
write the new state to this temp file
unlink config.db.last
create a hard link config.db <-> config.db.last
atomically rename config.db.tmp.hostname.pid to config.db
If you get the SIGTERM during any of those steps, I think you're still guaranteed to have a valid config.db or config.db.last file, and in the presence of config.db begin MIA, Mailman automatically falls back to config.db.last (and if config.db.last is MIA, config.db should still be valid and in place). It's possible that the new state in the tmp file won't become current, but that's what I meant by the abort implication, and I think that's fine (and actually correct semantics -- I agree with you Chuq). If you get the signal in the middle of writing the config.db.tmp file, then oh well, it's corrupt, but it'll never be made the current state.
You have to be careful but fast when you get that SIGTERM because three seconds later you're getting SIGKILLed and at that point, you're screwed. I think we're safe, at least for the config.db files. I need to make sure that other files like request.db are safe from corruption (I actually think this one might be vulnerable because it doesn't take the same precautions as with config.db).
-Barry
![](https://secure.gravatar.com/avatar/2206e8a0d58563f815a7568ea6675313.jpg?s=120&d=mm&r=g)
On 5/1/01 8:52 PM, "Barry A. Warsaw" <barry@digicool.com> wrote:
If you have that, don't you have chaos anyway? Is the create&link lock style reliable over NFS in the first place? Isn't putting locks from multiple machiens in the same directory just a plain old bad idea?
I ran into a strange little problem today -- I'm using time() to generate a filename for a temporary directory. Works great; until you start running multiple processes on a 2 CPU machine. I started having two processes get the same time() value (which is impossible on a single CPU system) and fight over the same directory. I'm now doing a random() based sleep to get away from this.
It seems to me that sharing a single directory for locks over NFS is asking for the same kind of weirdie problems I got to track down today... NFS changes the paradigms enough, especially about atomic operations, that I'm worried you're asking for issues here. I'd either put locks on a local disk, or make sure each machine has its own non-shared directory. And fi you do, the proc information will be relevant....
I really think the lock-setting code needs some form of "this is dead, break it" code in it -- that solves 99% of the problem, really. I know you can't depend on flock(), so that the kernel manages locks, but perhaps the config code could test for it nad use it if it exists and fall back to the current system if it doesn't?
I've thought about that, also, but it seems like a duct tape solution to me.
Okay, but... What if we go away after this is created. What is in charge of cleaning up leftovers? Realize I'm stretching a point here -- but in an extreme case, if nothing cleans this stuff up, you have a two-pronged denial of service attack. One would be when all of the .pid numbers have temp files created, so future attempts start failing, the other is when you have enough of the tmp files that the disk fills up... Either long-term neglect or a motivated dinker could shut a list server down....
Have you considered forking and detaching for the write? At that point, you could daemonize a sub-process to do the actual DB update, and the parent handles talking to the user, so if it's aborted, it wn't be killed. At some point, you pass a go/no-go point and if it's go, you can safely detach from the user and isolate yourself so you know you'll finish....
![](https://secure.gravatar.com/avatar/24b55bf954fdd812040daa1db3ff9336.jpg?s=120&d=mm&r=g)
On Tuesday 01 May 2001 13:40, you wrote: [.....snip.....]
Barry, wanted to thank you muchly for the lengthy description of the problem and the patch that you provided. I figured that this was probably what was happening, after having gone through the process of running the CGIs repeatedly myself here.
As for the semantics of "save the list only if everything was successful", I too believe that those are livable (and likely proper) semantics to live with.
Will let you know again if we continue to have this problem, but from what
I've seen so far this appears to have fixed the major fire that I've had.
Now all I've got to figure out is how to try to speed up the admin CGIs so
that they don't take two or three minutes to load when dealing with large
lists...
Thanks again Barry,
-- Graham TerMarsch
![](https://secure.gravatar.com/avatar/cb6b2a19d7ea20358a4c4f0332afc3ef.jpg?s=120&d=mm&r=g)
"CVR" == Chuq Von Rospach <chuqui@plaidworks.com> writes:
>> Barry, wanted to thank you muchly for the lengthy description
>> of the problem and the patch that you provided. I figured that
>> this was probably what was happening, after having gone through
>> the process of running the CGIs repeatedly myself here.
CVR> By the by, I hate to say this, but I think this thing
CVR> deserves a 2.0.5 subrelease....
I was thinking the same thing. :/ I'd like to get more feedback on how important/useful you site guys think it would be.
Chuq's vote is counted. :)
-Barry
![](https://secure.gravatar.com/avatar/b26a573943ef4f62e9f6c2015eeba9ac.jpg?s=120&d=mm&r=g)
----- Original Message ----- From: "Barry A. Warsaw" <barry@digicool.com>
Just a quick vote: your patch has solved our problems with the stale locks, too and we haven't seen any issues coming up for the last couple of days now. Many thanks, Barry!
I'm positive that moving the storage to MySQL/PgSQL/Oracle will speed up the db side, guess this is something for the 3.0 release.
Gergo
Do You Yahoo!? Get your free @yahoo.com address at http://mail.yahoo.com
![](https://secure.gravatar.com/avatar/5853d442fdc18068f320e70b7f040a17.jpg?s=120&d=mm&r=g)
I'd have to agree with this. Grabbing any 30 records out of 65k is an operation that shouldn't take more than about a second on any modern DB. It also seems like it would skirt around the need for some of these complicated locking tricks.
- H
--
Harold Paulson Sierra Web Design haroldp@sierraweb.com http://www.sierraweb.com VOICE: 775.833.9500 FAX: 810.314.1517
![](https://secure.gravatar.com/avatar/cb6b2a19d7ea20358a4c4f0332afc3ef.jpg?s=120&d=mm&r=g)
"HP" == Harold Paulson <haroldp@sierraweb.com> writes:
HP> I'd have to agree with this. Grabbing any 30 records out of
HP> 65k is an operation that shouldn't take more than about a
HP> second on any modern DB. It also seems like it would skirt
HP> around the need for some of these complicated locking tricks.
No question about it. It's a timing issue: I want to move as quickly as possible towards a 2.1 release which will be the first internationalized Mailman. Rewriting (and more importantly, testing, and writing the migration code!) the underlying db would in my estimation push development and release back way too far.
-Barry
![](https://secure.gravatar.com/avatar/cb6b2a19d7ea20358a4c4f0332afc3ef.jpg?s=120&d=mm&r=g)
"GS" == Gergo Soros <soros_gergo@yahoo.com> writes:
GS> Just a quick vote: your patch has solved our problems with the
GS> stale locks, too and we haven't seen any issues coming up for
GS> the last couple of days now. Many thanks, Barry!
Ah good. That at least gives me enough confidence to guinea pig the changes on the {python,zope}.org site. I'll seriously consider a 2.0.5 release that contains just this fix.
>> GT> major fire that I've had. Now all I've got to figure out
>> is GT> how to try to speed up the admin CGIs so that they don't
>> take GT> two or three minutes to load when dealing with large
>> lists...
GS> I'm positive that moving the storage to MySQL/PgSQL/Oracle
GS> will speed up the db side, guess this is something for the 3.0
GS> release.
Me too, and yes, definitely a 3.0 thing.
-Barry
![](https://secure.gravatar.com/avatar/24b55bf954fdd812040daa1db3ff9336.jpg?s=120&d=mm&r=g)
On Thursday 03 May 2001 01:13, Barry A. Warsaw wrote:
Count my vote in; I'm all for smaller, quicker, more incremental releases (especially when they contain bugfixes). :)
-- Graham TerMarsch
![](https://secure.gravatar.com/avatar/cb6b2a19d7ea20358a4c4f0332afc3ef.jpg?s=120&d=mm&r=g)
"CVR" == Chuq Von Rospach <chuqui@plaidworks.com> writes:
CVR> By the by, I hate to say this, but I think this thing
CVR> deserves a 2.0.5 subrelease....
Oh, let me add (since it's 4am :), that I also would like to get more feedback on the success of the patch. One thing I don't want to do is have 2.0.5 make things worse!
-Barry
![](https://secure.gravatar.com/avatar/cb6b2a19d7ea20358a4c4f0332afc3ef.jpg?s=120&d=mm&r=g)
"GT" == Graham TerMarsch <mailman@howlingfrog.com> writes:
GT> Barry, wanted to thank you muchly for the lengthy description
GT> of the problem and the patch that you provided. I figured
GT> that this was probably what was happening, after having gone
GT> through the process of running the CGIs repeatedly myself
GT> here.
You're welcome, and thanks for the feedback. I've committed those changes to CVS so they'll be part of 2.1.
GT> As for the semantics of "save the list only if everything was
GT> successful", I too believe that those are livable (and likely
GT> proper) semantics to live with.
I think it's the only sane thing to do.
GT> Will let you know again if we continue to have this problem,
GT> but from what I've seen so far this appears to have fixed the
GT> major fire that I've had. Now all I've got to figure out is
GT> how to try to speed up the admin CGIs so that they don't take
GT> two or three minutes to load when dealing with large lists...
Check to see if it's disk access. Remember you've got to load the entire marshaled dict into memory for each web hit, and that's gotta be expensive. Things will get much better for 2.1 from the email side because there'll be some caching involved (there's a long running qrunner process now), but you'll still pay the penalty on the web side. Really fixing that will be a job for Mailman 3.0.
-Barry
![](https://secure.gravatar.com/avatar/24b55bf954fdd812040daa1db3ff9336.jpg?s=120&d=mm&r=g)
On Thursday 03 May 2001 01:08, Barry A. Warsaw wrote:
Don't think that its disk access, but I could be wrong. Box has 512MB of RAM, and with the volume of traffic we've got coming in to this machine the Mailman database should be stuck in cache perpetually. I could be totally wrong here, but I'm inclined to believe that its not disk IO thats holding this one up.
-- Graham TerMarsch
![](https://secure.gravatar.com/avatar/cb6b2a19d7ea20358a4c4f0332afc3ef.jpg?s=120&d=mm&r=g)
"BAW" == Barry A Warsaw <barry@digicool.com> writes:
BAW> So the code in admin.py looks something like:
| def main():
| ...
| mlist = MailList.MailList(listname, lock=0)
| ...
|
| def sigterm_handler(signum, frame, mlist=mlist):
| mlist.Unlock()
|
| mlist.Lock()
| try:
| signal.signal(signal.SIGTERM, sigterm_handler)
| ...
| mlist.Save()
| finally:
| mlist.Unlock()
I think this code isn't quite right. I think to be totally safe, you want sigterm_handler() to sys.exit(0) after the call to mlist.Unlock(). Otherwise, depending on race conditions, after unlocking the list you could still enter Save(), which would fail because it would first try to refresh a lock you no longer own.
I'll work up a proper patch to Mailman 2.0.4 and post it to SF for you to try. Or you could modify your patched version and test it in the meantime.
-Barry
![](https://secure.gravatar.com/avatar/cb6b2a19d7ea20358a4c4f0332afc3ef.jpg?s=120&d=mm&r=g)
"BAW" == Barry A Warsaw <barry@digicool.com> writes:
BAW> I think this code isn't quite right. I think to be totally
BAW> safe, you want sigterm_handler() to sys.exit(0) after the
BAW> call to mlist.Unlock(). Otherwise, depending on race
BAW> conditions, after unlocking the list you could still enter
BAW> Save(), which would fail because it would first try to
BAW> refresh a lock you no longer own.
BAW> I'll work up a proper patch to Mailman 2.0.4 and post it to
BAW> SF for you to try. Or you could modify your patched version
BAW> and test it in the meantime.
On second thought, I'm only going to post the patch when I do the 2.0.5 release (which will happen tomorrow). All the changes are in the CVS Release_2_0_1-branch maintenance branch so you can check them out there for a preview.
I plan on doing some more testing tomorrow before the release, but so far it looks pretty good.
-Barry
![](https://secure.gravatar.com/avatar/3744bbfcb703bf1b3e473fd93e0c013c.jpg?s=120&d=mm&r=g)
Graham TerMarsch wrote:
I have a perl script run every 15 minutes and check for lock files. Any
lock file older than 10 minutes gets forcibly removed (by the script). This way my admins don't have to wait till I come around to manually delete them. Keeping in mind that the default timeout for a webserver is generally 5 minutes, the 10 minutes wait period works out great.
AMK4
-- W | | I haven't lost my mind; it's backed up on tape somewhere. |____________________________________________________________________
Ashley M. Kirchner <mailto:ashley@pcraft.com> . 303.442.6410 x130
SysAdmin / Websmith . 800.441.3873 x130
Photo Craft Laboratories, Inc. . eFax 248.671.0909
http://www.pcraft.com . 3550 Arapahoe Ave #6
.................. . . . . Boulder, CO 80303, USA
![](https://secure.gravatar.com/avatar/b26a573943ef4f62e9f6c2015eeba9ac.jpg?s=120&d=mm&r=g)
Graham,
We also experienced similar problems in the past with large lists. We found that if the admin is accessing the long and slow-loading members admin pages, and does not wait for the page to complete (e.g. clicks one of the admin links or hits refresh or stop) the lock will remain. As a temporary solution we have instructed our admins to wait for all pages to completely load, no stale locks since then. This is on a Cobalt RaQ4i (Redhat) with Apache 1.3x. Hope this helps,
Gergo
Do You Yahoo!? Get your free @yahoo.com address at http://mail.yahoo.com
![](https://secure.gravatar.com/avatar/24b55bf954fdd812040daa1db3ff9336.jpg?s=120&d=mm&r=g)
On Friday 27 April 2001 13:46, Gergo Soros wrote:
Is probably related to what we're experiencing, but I'm finding that even just due to the volume of hits on the "subscribe" page, that we're having this problem. While testing, I made sure that we had _no_ accesses to admin pages, and that all of the "subscribe" hits that were coming through waited for the complete response and didn't time out or have the "stop" button pressed. Even in this scenario, I still ended up with stale locks lingering around.
Would it be correct to say that if the CGI process dies for some unforseen reason (e.g. Apache kills it off because the user pressed the "stop" button or the HTTP connection timed out), that the lock from that process gets left around as a lingering lock?
-- Graham TerMarsch
![](https://secure.gravatar.com/avatar/cb6b2a19d7ea20358a4c4f0332afc3ef.jpg?s=120&d=mm&r=g)
I'm CC'ing Mailman developers because I want to get some feedback from the Unix hackers amonst us; apologies for duplicates...
>> We also experienced similar problems in the past with large
>> lists. We found that if the admin is accessing the long and
>> slow-loading members admin pages, and does not wait for the
>> page to complete (e.g. clicks one of the admin links or hits
>> refresh or stop) the lock will remain. As a temporary solution
>> we have instructed our admins to wait for all pages to
>> completely load, no stale locks since then. This is on a Cobalt
>> RaQ4i (Redhat) with Apache 1.3x. Hope this helps,
"GT" == Graham TerMarsch <mailman@howlingfrog.com> writes:
GT> Is probably related to what we're experiencing, but I'm
GT> finding that even just due to the volume of hits on the
GT> "subscribe" page, that we're having this problem. While
GT> testing, I made sure that we had _no_ accesses to admin pages,
GT> and that all of the "subscribe" hits that were coming through
GT> waited for the complete response and didn't time out or have
GT> the "stop" button pressed. Even in this scenario, I still
GT> ended up with stale locks lingering around.
[...]
GT> Would it be correct to say that if the CGI process dies for
GT> some unforseen reason (e.g. Apache kills it off because the
GT> user pressed the "stop" button or the HTTP connection timed
GT> out), that the lock from that process gets left around as a
GT> lingering lock?
After an all night hacking jag, lots of google searches, a re-read of Stevens Ch.10, and a short conversation with Guido, I believe I know what's going on. That's the good news. The bad news is that I don't think there's a perfect solution, but there probably is a better solution.
FTR, here's my development environment: Apache 1.3.19, Python 2.1, RH6.2-ish, kernel 2.2.18, NS 4.74, and the current Mailman 2.1 CVS snapshot.
Here's what I did: I loaded up a list with 60000 subscribers, then went to the members page. It did indeed take a long time, and if I let it run to completion, I get the page as expected and no locks.
However, if I hit the stop button before the page is finished loading, I can see that the CGI process continues to run for a while and then it may or may not clear the locks. The page is not complete. Since sometimes the locks are cleared and sometimes they're left, it's pretty clear there are race conditions involved.
I did a bit of web searching and here's what I think a stock Apache mod_cgi is supposed to do in this situation:
The user hits the stop button on the browser. The client will close the socket, which is usually eventually recognized by the server.
The cgi script meanwhile is merrily doing some long calculations, and eventually it will try to write output. It's at this point that it may be possible to detect that the socket has gone away. What's interesting is that it appears that the cgi script actually writes output to the parent Apache process, which perhaps buffers the output before sending to the client. In any event, it's the Apache parent process that gets a SIGPIPE indicating the socket it's writing to is closed.
Apache catches the SIGPIPE, turns around and sends a SIGTERM to the cgi process. Apache waits three seconds and if the cgi hasn't exited yet, it sends a SIGKILL.
The default behavior for SIGTERM is to terminate the process, so whether it's the SIGTERM or SIGKILL that does the dirty work, in the end, the naive cgi script can get summarily killed without a chance at clean up.
Enter Python. Python installs a few of its own signal handlers, most notably for SIGINT which it maps to the KeyboardInterrupt exception. Normally though, SIGTERM isn't caught by Python, so again, the naive Python cgi script will summarily die when receiving a SIGTERM. Under mod_cgi, even catching SIGTERM may not help if more than 3 seconds of processing occurs (I don't know if this is clock time or cpu time). Again, that's because Apache turns around and SIGKILLs the cgi, and that signal can't be caught (remember, this is what bit us a while back w.r.t. Postfix filter programs).
So maybe the answer is to install a Python-level SIGTERM handler that will release the lock. To understand this, you need to know how Python handles signals. When you set a signal handler in Python, at the C level, Python installs its own signal handler, which simply sets a flag and returns. That's because Python will only run Python-level signal handlers in between bytecode boundaries. So your Python signal handler can get run at some arbitrary point in the future, and likely /not/ when the signal actually occurred.
All this is very dodgy in the face of try/finally clauses, which is how cgi's like admin.py ensure that the mailing list is saved and unlocked when an exception occurs. For example, my initial solution was to install a signal handler that raised an exception, thinking that this would trigger the finally clause and do the normal save & unlock. But because of race conditions and interactions with the Python interpreter loop, I often saw that while the exception was getting raised, the finally clause wasn't always getting executed. Guido confirmed that this isn't a safe approach.
So my next approach is to write a very minimal signal handler that only unlocks the list, and install this on SIGTERM. The trick is to make the signal handler, e.g. a nested function of main() in admin.py, and pass the MailList object into it using the default argument trick.
This seems to work, in that the locks appear to be cleared in the cases where they were left laying around before. But because of all the race conditions, I can't be 100% sure.
If you've read this far, the implication is that if the user hits the stop button, Mailman will in essence abort any changes to list configuration that this invocation may have made. Alternatively, we could try to save & unlock in the signal handler, but that raises the possibility of race conditions again. Also, it makes sense to move the save of the list data into the try: part of the clause and only do the unlocking in the finally. That way, the finally clause and the SIGTERM handler have the same semantics, and the list will get unlocked in the face of either an exception or a signal. But the list database will only get saved on sucessful completion of the task. I can live with those semantics (I think ;).
So the code in admin.py looks something like:
def main(): ... mlist = MailList.MailList(listname, lock=0) ...
def sigterm_handler(signum, frame, mlist=mlist):
mlist.Unlock()
mlist.Lock()
try:
signal.signal(signal.SIGTERM, sigterm_handler)
...
mlist.Save()
finally:
mlist.Unlock()
I think there are still opportunities to get nailed, but I think this works as well as is possible. In my testing I've been able to clear the locks when the stop button is pushed. In the rare event that the list's config.db file gets corrupted (i.e. the handler gets called in the middle of mlist.Save()), we've always got the config.db.last file to fall back on.
The other good news is that I think only admindb.py, admin.py, and confirm.py, need to be modified, since those are the only scripts that write to list attributes. I've been pretty diligent in Mailman 2.1 to only lock the mailing lists when necessary, so if the scripts are read-only, they don't acquire the lock (and get the state of the database as of the time of the initial Load()).
Phew!
-Barry
![](https://secure.gravatar.com/avatar/2206e8a0d58563f815a7568ea6675313.jpg?s=120&d=mm&r=g)
On 5/1/01 1:40 PM, "Barry A. Warsaw" <barry@digicool.com> wrote:
That would match something I've been seeing and sorta tring to debug (except it seems for me it only happens when I'm not watching, of course). And once someone hits stop and reloads, they hang waiting for a lock, and so do others -- and I can come in and check in a morning and see 50 broken locks for a list...
So my next approach is to write a very minimal signal handler that only unlocks the list, and install this on SIGTERM.
That's work, but I suggest an alternative, or a second item: when you try to set a lock, and one is set, see if the process that set the lock still exists (the info is available in the locks/ dir with a bit of poking). If that process is gone, delete the lock and move forward.
That way, both sides of the equation can fix the problem if needed.
As it should, IMHO. The only caveat, I think, is that you need to look through the code for places where breaking in the middle can leave you with incomplete or corrupted data, and protect those pieces from breakage, and handle the interrupt once you leave them.
If you can be sure that won't happen, great. But I'd make double-sure...
![](https://secure.gravatar.com/avatar/cb6b2a19d7ea20358a4c4f0332afc3ef.jpg?s=120&d=mm&r=g)
"CVR" == Chuq Von Rospach <chuqui@plaidworks.com> writes:
>> However, if I hit the stop button before the page is finished
>> loading, I can see that the CGI process continues to run for a
>> while and then it may or may not clear the locks.
CVR> That would match something I've been seeing and sorta tring
CVR> to debug (except it seems for me it only happens when I'm not
CVR> watching, of course).
Same here, the first 15 or so times I tried to recreate the bug. I had nearly as long a message already composed that contained a different evaluation and summarized as "works for me". Had to delete that and start over when I actually /did/ reproduce it. ;)
>> So my next approach is to write a very minimal signal handler
>> that only unlocks the list, and install this on SIGTERM.
CVR> That's work, but I suggest an alternative, or a second item:
CVR> when you try to set a lock, and one is set, see if the
CVR> process that set the lock still exists (the info is available
CVR> in the locks/ dir with a bit of poking). If that process is
CVR> gone, delete the lock and move forward.
CVR> That way, both sides of the equation can fix the problem if
CVR> needed.
I've avoid that because of NFS issues, i.e. if you've got multiple Mailman installations sharing an NFS partition, the pids aren't relevant. The program can't know that, but the sysadmin can, so I'm inclined to instead write a script that will zap old locks if their processes don't exist. That way the site admin can run those scripts as he sees appropriate based on his installation.
>> If you've read this far, the implication is that if the user
>> hits the stop button, Mailman will in essence abort any changes
>> to list configuration that this invocation may have made.
CVR> As it should, IMHO. The only caveat, I think, is that you
CVR> need to look through the code for places where breaking in
CVR> the middle can leave you with incomplete or corrupted data,
CVR> and protect those pieces from breakage, and handle the
CVR> interrupt once you leave them.
CVR> If you can be sure that won't happen, great. But I'd make
CVR> double-sure...
I think the only critical section is MailList.Save(), or more accurately, MailList.__save(). But even here I think you're as safe as possible because Mailman writes new state using the following algorithm:
open a config.db.tmp.hostname.pid file
write the new state to this temp file
unlink config.db.last
create a hard link config.db <-> config.db.last
atomically rename config.db.tmp.hostname.pid to config.db
If you get the SIGTERM during any of those steps, I think you're still guaranteed to have a valid config.db or config.db.last file, and in the presence of config.db begin MIA, Mailman automatically falls back to config.db.last (and if config.db.last is MIA, config.db should still be valid and in place). It's possible that the new state in the tmp file won't become current, but that's what I meant by the abort implication, and I think that's fine (and actually correct semantics -- I agree with you Chuq). If you get the signal in the middle of writing the config.db.tmp file, then oh well, it's corrupt, but it'll never be made the current state.
You have to be careful but fast when you get that SIGTERM because three seconds later you're getting SIGKILLed and at that point, you're screwed. I think we're safe, at least for the config.db files. I need to make sure that other files like request.db are safe from corruption (I actually think this one might be vulnerable because it doesn't take the same precautions as with config.db).
-Barry
![](https://secure.gravatar.com/avatar/2206e8a0d58563f815a7568ea6675313.jpg?s=120&d=mm&r=g)
On 5/1/01 8:52 PM, "Barry A. Warsaw" <barry@digicool.com> wrote:
If you have that, don't you have chaos anyway? Is the create&link lock style reliable over NFS in the first place? Isn't putting locks from multiple machiens in the same directory just a plain old bad idea?
I ran into a strange little problem today -- I'm using time() to generate a filename for a temporary directory. Works great; until you start running multiple processes on a 2 CPU machine. I started having two processes get the same time() value (which is impossible on a single CPU system) and fight over the same directory. I'm now doing a random() based sleep to get away from this.
It seems to me that sharing a single directory for locks over NFS is asking for the same kind of weirdie problems I got to track down today... NFS changes the paradigms enough, especially about atomic operations, that I'm worried you're asking for issues here. I'd either put locks on a local disk, or make sure each machine has its own non-shared directory. And fi you do, the proc information will be relevant....
I really think the lock-setting code needs some form of "this is dead, break it" code in it -- that solves 99% of the problem, really. I know you can't depend on flock(), so that the kernel manages locks, but perhaps the config code could test for it nad use it if it exists and fall back to the current system if it doesn't?
I've thought about that, also, but it seems like a duct tape solution to me.
Okay, but... What if we go away after this is created. What is in charge of cleaning up leftovers? Realize I'm stretching a point here -- but in an extreme case, if nothing cleans this stuff up, you have a two-pronged denial of service attack. One would be when all of the .pid numbers have temp files created, so future attempts start failing, the other is when you have enough of the tmp files that the disk fills up... Either long-term neglect or a motivated dinker could shut a list server down....
Have you considered forking and detaching for the write? At that point, you could daemonize a sub-process to do the actual DB update, and the parent handles talking to the user, so if it's aborted, it wn't be killed. At some point, you pass a go/no-go point and if it's go, you can safely detach from the user and isolate yourself so you know you'll finish....
![](https://secure.gravatar.com/avatar/24b55bf954fdd812040daa1db3ff9336.jpg?s=120&d=mm&r=g)
On Tuesday 01 May 2001 13:40, you wrote: [.....snip.....]
Barry, wanted to thank you muchly for the lengthy description of the problem and the patch that you provided. I figured that this was probably what was happening, after having gone through the process of running the CGIs repeatedly myself here.
As for the semantics of "save the list only if everything was successful", I too believe that those are livable (and likely proper) semantics to live with.
Will let you know again if we continue to have this problem, but from what
I've seen so far this appears to have fixed the major fire that I've had.
Now all I've got to figure out is how to try to speed up the admin CGIs so
that they don't take two or three minutes to load when dealing with large
lists...
Thanks again Barry,
-- Graham TerMarsch
![](https://secure.gravatar.com/avatar/cb6b2a19d7ea20358a4c4f0332afc3ef.jpg?s=120&d=mm&r=g)
"CVR" == Chuq Von Rospach <chuqui@plaidworks.com> writes:
>> Barry, wanted to thank you muchly for the lengthy description
>> of the problem and the patch that you provided. I figured that
>> this was probably what was happening, after having gone through
>> the process of running the CGIs repeatedly myself here.
CVR> By the by, I hate to say this, but I think this thing
CVR> deserves a 2.0.5 subrelease....
I was thinking the same thing. :/ I'd like to get more feedback on how important/useful you site guys think it would be.
Chuq's vote is counted. :)
-Barry
![](https://secure.gravatar.com/avatar/b26a573943ef4f62e9f6c2015eeba9ac.jpg?s=120&d=mm&r=g)
----- Original Message ----- From: "Barry A. Warsaw" <barry@digicool.com>
Just a quick vote: your patch has solved our problems with the stale locks, too and we haven't seen any issues coming up for the last couple of days now. Many thanks, Barry!
I'm positive that moving the storage to MySQL/PgSQL/Oracle will speed up the db side, guess this is something for the 3.0 release.
Gergo
Do You Yahoo!? Get your free @yahoo.com address at http://mail.yahoo.com
![](https://secure.gravatar.com/avatar/5853d442fdc18068f320e70b7f040a17.jpg?s=120&d=mm&r=g)
I'd have to agree with this. Grabbing any 30 records out of 65k is an operation that shouldn't take more than about a second on any modern DB. It also seems like it would skirt around the need for some of these complicated locking tricks.
- H
--
Harold Paulson Sierra Web Design haroldp@sierraweb.com http://www.sierraweb.com VOICE: 775.833.9500 FAX: 810.314.1517
![](https://secure.gravatar.com/avatar/cb6b2a19d7ea20358a4c4f0332afc3ef.jpg?s=120&d=mm&r=g)
"HP" == Harold Paulson <haroldp@sierraweb.com> writes:
HP> I'd have to agree with this. Grabbing any 30 records out of
HP> 65k is an operation that shouldn't take more than about a
HP> second on any modern DB. It also seems like it would skirt
HP> around the need for some of these complicated locking tricks.
No question about it. It's a timing issue: I want to move as quickly as possible towards a 2.1 release which will be the first internationalized Mailman. Rewriting (and more importantly, testing, and writing the migration code!) the underlying db would in my estimation push development and release back way too far.
-Barry
![](https://secure.gravatar.com/avatar/cb6b2a19d7ea20358a4c4f0332afc3ef.jpg?s=120&d=mm&r=g)
"GS" == Gergo Soros <soros_gergo@yahoo.com> writes:
GS> Just a quick vote: your patch has solved our problems with the
GS> stale locks, too and we haven't seen any issues coming up for
GS> the last couple of days now. Many thanks, Barry!
Ah good. That at least gives me enough confidence to guinea pig the changes on the {python,zope}.org site. I'll seriously consider a 2.0.5 release that contains just this fix.
>> GT> major fire that I've had. Now all I've got to figure out
>> is GT> how to try to speed up the admin CGIs so that they don't
>> take GT> two or three minutes to load when dealing with large
>> lists...
GS> I'm positive that moving the storage to MySQL/PgSQL/Oracle
GS> will speed up the db side, guess this is something for the 3.0
GS> release.
Me too, and yes, definitely a 3.0 thing.
-Barry
![](https://secure.gravatar.com/avatar/24b55bf954fdd812040daa1db3ff9336.jpg?s=120&d=mm&r=g)
On Thursday 03 May 2001 01:13, Barry A. Warsaw wrote:
Count my vote in; I'm all for smaller, quicker, more incremental releases (especially when they contain bugfixes). :)
-- Graham TerMarsch
![](https://secure.gravatar.com/avatar/cb6b2a19d7ea20358a4c4f0332afc3ef.jpg?s=120&d=mm&r=g)
"CVR" == Chuq Von Rospach <chuqui@plaidworks.com> writes:
CVR> By the by, I hate to say this, but I think this thing
CVR> deserves a 2.0.5 subrelease....
Oh, let me add (since it's 4am :), that I also would like to get more feedback on the success of the patch. One thing I don't want to do is have 2.0.5 make things worse!
-Barry
![](https://secure.gravatar.com/avatar/cb6b2a19d7ea20358a4c4f0332afc3ef.jpg?s=120&d=mm&r=g)
"GT" == Graham TerMarsch <mailman@howlingfrog.com> writes:
GT> Barry, wanted to thank you muchly for the lengthy description
GT> of the problem and the patch that you provided. I figured
GT> that this was probably what was happening, after having gone
GT> through the process of running the CGIs repeatedly myself
GT> here.
You're welcome, and thanks for the feedback. I've committed those changes to CVS so they'll be part of 2.1.
GT> As for the semantics of "save the list only if everything was
GT> successful", I too believe that those are livable (and likely
GT> proper) semantics to live with.
I think it's the only sane thing to do.
GT> Will let you know again if we continue to have this problem,
GT> but from what I've seen so far this appears to have fixed the
GT> major fire that I've had. Now all I've got to figure out is
GT> how to try to speed up the admin CGIs so that they don't take
GT> two or three minutes to load when dealing with large lists...
Check to see if it's disk access. Remember you've got to load the entire marshaled dict into memory for each web hit, and that's gotta be expensive. Things will get much better for 2.1 from the email side because there'll be some caching involved (there's a long running qrunner process now), but you'll still pay the penalty on the web side. Really fixing that will be a job for Mailman 3.0.
-Barry
![](https://secure.gravatar.com/avatar/24b55bf954fdd812040daa1db3ff9336.jpg?s=120&d=mm&r=g)
On Thursday 03 May 2001 01:08, Barry A. Warsaw wrote:
Don't think that its disk access, but I could be wrong. Box has 512MB of RAM, and with the volume of traffic we've got coming in to this machine the Mailman database should be stuck in cache perpetually. I could be totally wrong here, but I'm inclined to believe that its not disk IO thats holding this one up.
-- Graham TerMarsch
![](https://secure.gravatar.com/avatar/cb6b2a19d7ea20358a4c4f0332afc3ef.jpg?s=120&d=mm&r=g)
"BAW" == Barry A Warsaw <barry@digicool.com> writes:
BAW> So the code in admin.py looks something like:
| def main():
| ...
| mlist = MailList.MailList(listname, lock=0)
| ...
|
| def sigterm_handler(signum, frame, mlist=mlist):
| mlist.Unlock()
|
| mlist.Lock()
| try:
| signal.signal(signal.SIGTERM, sigterm_handler)
| ...
| mlist.Save()
| finally:
| mlist.Unlock()
I think this code isn't quite right. I think to be totally safe, you want sigterm_handler() to sys.exit(0) after the call to mlist.Unlock(). Otherwise, depending on race conditions, after unlocking the list you could still enter Save(), which would fail because it would first try to refresh a lock you no longer own.
I'll work up a proper patch to Mailman 2.0.4 and post it to SF for you to try. Or you could modify your patched version and test it in the meantime.
-Barry
![](https://secure.gravatar.com/avatar/cb6b2a19d7ea20358a4c4f0332afc3ef.jpg?s=120&d=mm&r=g)
"BAW" == Barry A Warsaw <barry@digicool.com> writes:
BAW> I think this code isn't quite right. I think to be totally
BAW> safe, you want sigterm_handler() to sys.exit(0) after the
BAW> call to mlist.Unlock(). Otherwise, depending on race
BAW> conditions, after unlocking the list you could still enter
BAW> Save(), which would fail because it would first try to
BAW> refresh a lock you no longer own.
BAW> I'll work up a proper patch to Mailman 2.0.4 and post it to
BAW> SF for you to try. Or you could modify your patched version
BAW> and test it in the meantime.
On second thought, I'm only going to post the patch when I do the 2.0.5 release (which will happen tomorrow). All the changes are in the CVS Release_2_0_1-branch maintenance branch so you can check them out there for a preview.
I plan on doing some more testing tomorrow before the release, but so far it looks pretty good.
-Barry
![](https://secure.gravatar.com/avatar/3744bbfcb703bf1b3e473fd93e0c013c.jpg?s=120&d=mm&r=g)
Graham TerMarsch wrote:
I have a perl script run every 15 minutes and check for lock files. Any
lock file older than 10 minutes gets forcibly removed (by the script). This way my admins don't have to wait till I come around to manually delete them. Keeping in mind that the default timeout for a webserver is generally 5 minutes, the 10 minutes wait period works out great.
AMK4
-- W | | I haven't lost my mind; it's backed up on tape somewhere. |____________________________________________________________________
Ashley M. Kirchner <mailto:ashley@pcraft.com> . 303.442.6410 x130
SysAdmin / Websmith . 800.441.3873 x130
Photo Craft Laboratories, Inc. . eFax 248.671.0909
http://www.pcraft.com . 3550 Arapahoe Ave #6
.................. . . . . Boulder, CO 80303, USA
participants (6)
-
Ashley M. Kirchner
-
barry@digicool.com
-
Chuq Von Rospach
-
Gergo Soros
-
Graham TerMarsch
-
Harold Paulson