[Twisted-Python] Anyone seeing hanging of Twisted HTTPS servers?

We've recently picked up 3 or 4 minute blocks on one of our servers. Wondering if anyone else has seen these. We have a "zombiechecker" script that downloads the front page of the web front-end every minute (restarting the entire server if it can't connect). After 20 or 30 minutes (sometimes a few hours) we see 3 or 4 of those checks queued, the connection has been accepted, but the server appears to be frozen (the reactor doesn't appear to be processing anything, neither web front-end or the processing back-end). After the 3 or 4 minutes frozen, the server starts processing the back-end operations and eventually gets around to replying to the web front-end requests. Most of the time we see an error in the logs like this just as the blockage is clearing: Traceback (most recent call last): File "/home/mcfletch/cvs/twisted/python/log.py", line 41, in callWithContext return context.call({ILogContext: newCtx}, func, *args, **kw) File "/home/mcfletch/cvs/twisted/python/context.py", line 52, in callWithContext return self.currentContext().callWithContext(ctx, func, *args, **kw) File "/home/mcfletch/cvs/twisted/python/context.py", line 31, in callWithContext return func(*args,**kw) File "/home/mcfletch/cvs/twisted/internet/selectreactor.py", line 139, in _doReadOrWrite why = getattr(selectable, method)() --- <exception caught here> --- File "/home/mcfletch/cvs/twisted/internet/tcp.py", line 111, in doRead return Connection.doRead(self) File "/home/mcfletch/cvs/twisted/internet/tcp.py", line 343, in doRead data = self.socket.recv(self.bufferSize) OpenSSL.SSL.Error: [('SSL routines', 'SSL23_READ', 'ssl handshake failure')] but the web front-end query still goes through (possibly because web.client is retrying?). Anyway, just wondering if anyone else has seen it or whether we're off in funky land. In the meantime I'm going to try turning on every bit of logging in the system while I'm off at a meeting and see if I can discover the precise point where it's hanging. Oh, FreeBSD 5.3-RC2, Python 2.3.4, Twisted 2.0, Nevow from CVS. Thanks all, Mike -- ________________________________________________ Mike C. Fletcher Designer, VR Plumber, Coder http://www.vrplumber.com http://blog.vrplumber.com

On Wed, May 18, 2005 at 09:31:44AM -0400, Mike C. Fletcher wrote: [...]
If you really want lots of logging, pass --spew to twistd :)
Oh, FreeBSD 5.3-RC2, Python 2.3.4, Twisted 2.0, Nevow from CVS.
Potentially relevant as well is the version of PyOpenSSL -- I don't know much about the SSL stuff in Twisted, but it's possible the bug lies there. -Andrew.

On 5/19/05, Charles Moad <cmoad@indiana.edu> wrote:
I am getting this same error (with the request going through fine) with version 0.6 of PyOpenSSL.
Pretty much all Twisted https servers will get this error very often. It's just spurious debugging info, afaict, and we have never had any kind of hanging in our web app at work. I'd bet the hanging is something else. I suggest running 'strace -p <pid>' (or whatever the FreeBSD equivalent is) on the process when you know that it's hanging, and see what it's doing, perhaps blocking on a syscall. -- Twisted | Christopher Armstrong: International Man of Twistery Radix | -- http://radix.twistedmatrix.com | Release Manager, Twisted Project \\\V/// | -- http://twistedmatrix.com |o O| | Founding Member, Hobart Hacking Society w----v----w-+ -- http://hackingsociety.org/chapters/hash

Christopher Armstrong wrote:
Okay, eliminates that from consideration. Thanks all for the reports. BTW, I'm also running 0.6 of PyOpenSSL. Regarding systrace, I'd love to, but so far I've not been able to catch the dratted thing in the condition more than once or twice, and it's normally almost finished the hang by the time I get around to working on it. Suppose I should just sit here watching it for a few hours, or rig up a monitor of some form. Thanks again, all, Mike -- ________________________________________________ Mike C. Fletcher Designer, VR Plumber, Coder http://www.vrplumber.com http://blog.vrplumber.com

On May 19, 2005, at 12:20 PM, Mike C. Fletcher wrote:
I think it's probably more likely that it's not hanging in C code at all, but in a bit of application python code. What I'd do is add a "watchdog" timer, something like this. Of course, this won't work if it's blocking in a C call somewhere because the python signal handler won't be called until the C call returns. But even that might give you some useful information on what the problem is. import signal, pdb, sys from twisted.internet import task def timeout(*args): sys.stderr.write("SIGALRM timeout, breaking into debugger.\n") import pdb; pdb.set_trace() signal.signal(signal.SIGALRM, timeout) task.LoopingCall(signal.alarm, 10).start(1) James

James Y Knight wrote: ...
That's really nice. However, it seems that the alarm handler drops into the database thread, rather than the foreground thread. Still, it does make it possible to simply log the heck out of everything and force an exit of the whole application to see where in the log it's stopping. Thanks, Mike -- ________________________________________________ Mike C. Fletcher Designer, VR Plumber, Coder http://www.vrplumber.com http://blog.vrplumber.com

I am getting this same error (with the request going through fine) with version 0.6 of PyOpenSSL.
Pretty much all Twisted https servers will get this error very often. It's just spurious debugging info, afaict,
In fact, any old garbage goes to any SSL protected port, you get this nasty traceback in the logs. Really has to go, patch included: --- trunk/twisted/internet/tcp.py 2005-05-06 23:17:03.000000000 +0200 +++ trunk-mod/twisted/internet/tcp.py 2005-05-21 21:29:10.000000000 +0200 @@ -125,7 +125,7 @@ log.err() return main.CONNECTION_LOST except SSL.Error: - log.err() +# log.err() return main.CONNECTION_LOST def doWrite(self): -- Nicola Larosa - nico@tekNico.net How can I in five minutes shift a lifetime's hidden pain? -- Julia Fordham, Girlfriend, Porcelain, 1989

On Wed, May 18, 2005 at 09:31:44AM -0400, Mike C. Fletcher wrote: [...]
If you really want lots of logging, pass --spew to twistd :)
Oh, FreeBSD 5.3-RC2, Python 2.3.4, Twisted 2.0, Nevow from CVS.
Potentially relevant as well is the version of PyOpenSSL -- I don't know much about the SSL stuff in Twisted, but it's possible the bug lies there. -Andrew.

On 5/19/05, Charles Moad <cmoad@indiana.edu> wrote:
I am getting this same error (with the request going through fine) with version 0.6 of PyOpenSSL.
Pretty much all Twisted https servers will get this error very often. It's just spurious debugging info, afaict, and we have never had any kind of hanging in our web app at work. I'd bet the hanging is something else. I suggest running 'strace -p <pid>' (or whatever the FreeBSD equivalent is) on the process when you know that it's hanging, and see what it's doing, perhaps blocking on a syscall. -- Twisted | Christopher Armstrong: International Man of Twistery Radix | -- http://radix.twistedmatrix.com | Release Manager, Twisted Project \\\V/// | -- http://twistedmatrix.com |o O| | Founding Member, Hobart Hacking Society w----v----w-+ -- http://hackingsociety.org/chapters/hash

Christopher Armstrong wrote:
Okay, eliminates that from consideration. Thanks all for the reports. BTW, I'm also running 0.6 of PyOpenSSL. Regarding systrace, I'd love to, but so far I've not been able to catch the dratted thing in the condition more than once or twice, and it's normally almost finished the hang by the time I get around to working on it. Suppose I should just sit here watching it for a few hours, or rig up a monitor of some form. Thanks again, all, Mike -- ________________________________________________ Mike C. Fletcher Designer, VR Plumber, Coder http://www.vrplumber.com http://blog.vrplumber.com

On May 19, 2005, at 12:20 PM, Mike C. Fletcher wrote:
I think it's probably more likely that it's not hanging in C code at all, but in a bit of application python code. What I'd do is add a "watchdog" timer, something like this. Of course, this won't work if it's blocking in a C call somewhere because the python signal handler won't be called until the C call returns. But even that might give you some useful information on what the problem is. import signal, pdb, sys from twisted.internet import task def timeout(*args): sys.stderr.write("SIGALRM timeout, breaking into debugger.\n") import pdb; pdb.set_trace() signal.signal(signal.SIGALRM, timeout) task.LoopingCall(signal.alarm, 10).start(1) James

James Y Knight wrote: ...
That's really nice. However, it seems that the alarm handler drops into the database thread, rather than the foreground thread. Still, it does make it possible to simply log the heck out of everything and force an exit of the whole application to see where in the log it's stopping. Thanks, Mike -- ________________________________________________ Mike C. Fletcher Designer, VR Plumber, Coder http://www.vrplumber.com http://blog.vrplumber.com

I am getting this same error (with the request going through fine) with version 0.6 of PyOpenSSL.
Pretty much all Twisted https servers will get this error very often. It's just spurious debugging info, afaict,
In fact, any old garbage goes to any SSL protected port, you get this nasty traceback in the logs. Really has to go, patch included: --- trunk/twisted/internet/tcp.py 2005-05-06 23:17:03.000000000 +0200 +++ trunk-mod/twisted/internet/tcp.py 2005-05-21 21:29:10.000000000 +0200 @@ -125,7 +125,7 @@ log.err() return main.CONNECTION_LOST except SSL.Error: - log.err() +# log.err() return main.CONNECTION_LOST def doWrite(self): -- Nicola Larosa - nico@tekNico.net How can I in five minutes shift a lifetime's hidden pain? -- Julia Fordham, Girlfriend, Porcelain, 1989
participants (6)
-
Andrew Bennetts
-
Charles Moad
-
Christopher Armstrong
-
James Y Knight
-
Mike C. Fletcher
-
Nicola Larosa