[Twisted-Python] help with ssl timeout and not reconnecting client factory
From a client I'm getting this error:
2005/03/16 10:42 CET [cpushare_protocol,client] 'limit sell' 2005/03/17 06:31 CET [cpushare_protocol,client] Traceback (most recent call last): File "/usr/lib64/python2.3/site-packages/twisted/python/log.py", line 65, in callWithLogger callWithContext({"system": lp}, func, *args, **kw) File "/usr/lib64/python2.3/site-packages/twisted/python/log.py", line 52, in callWithContext return context.call({ILogContext: newCtx}, func, *args, **kw) File "/usr/lib64/python2.3/site-packages/twisted/python/context.py", line 43, in callWithContext return func(*args,**kw) File "/usr/lib64/python2.3/site-packages/twisted/internet/pollreactor.py", line 160, in _doReadOrWrite why = selectable.doRead() --- <exception caught here> --- File "/usr/lib64/python2.3/site-packages/twisted/internet/tcp.py", line 98, in doRead return Connection.doRead(self) File "/usr/lib64/python2.3/site-packages/twisted/internet/tcp.py", line 239, in doRead data = self.socket.recv(self.bufferSize) OpenSSL.SSL.SysCallError: (110, 'Connection timed out') 2005/03/17 06:31 CET [cpushare_protocol,client] <twisted.internet.ssl.Connector instance at 0x2aaaac28d950> will retry in 2 s econds 2005/03/17 06:31 CET [cpushare_protocol,client] Stopping factory <cpushare.proto.cpushare_factory instance at 0x2aaaad414290> 2005/03/17 06:31 CET [-] Starting factory <cpushare.proto.cpushare_factory instance at 0x2aaaad414290> 2005/03/17 06:31 CET [-] Connection failed. Reason: [Failure instance: Traceback: twisted.internet.error.TimeoutError, User t imeout caused connection failure. 2005/03/17 06:31 CET [-] ] 2005/03/17 06:31 CET [-] Stopping factory <cpushare.proto.cpushare_factory instance at 0x2aaaad414290> This is a reconnecting client factory, the python version is 2.3.4 and twisted version is 1.3.0a. The socket should sit in a idle state. No communication over that socket will happen (the app is under development), but it should not go in timeout unless the connection with the server ends and the keepalive events triggers a disconnect (I enabled keepalive on the tcp level). Even if it goes in timeout it must try to reconnect immediatly while it seems like it's hanging after the "Stopping factory". Earlier when I got the connection timed out event (for no apparent good reason) at least it was immediatly trying to reconnect: 2005/03/14 05:30 CET [cpushare_protocol,client] 'limit sell' 2005/03/14 17:59 CET [cpushare_protocol,client] Traceback (most recent call last): File "/usr/lib64/python2.3/site-packages/twisted/python/log.py", line 65, in callWithLogger callWithContext({"system": lp}, func, *args, **kw) File "/usr/lib64/python2.3/site-packages/twisted/python/log.py", line 52, in callWithContext return context.call({ILogContext: newCtx}, func, *args, **kw) File "/usr/lib64/python2.3/site-packages/twisted/python/context.py", line 43, in callWithContext return func(*args,**kw) File "/usr/lib64/python2.3/site-packages/twisted/internet/pollreactor.py", line 160, in _doReadOrWrite why = selectable.doRead() --- <exception caught here> --- File "/usr/lib64/python2.3/site-packages/twisted/internet/tcp.py", line 98, in doRead return Connection.doRead(self) File "/usr/lib64/python2.3/site-packages/twisted/internet/tcp.py", line 239, in doRead data = self.socket.recv(self.bufferSize) OpenSSL.SSL.SysCallError: (110, 'Connection timed out') 2005/03/14 17:59 CET [cpushare_protocol,client] <twisted.internet.ssl.Connector instance at 0x2aaaac28d950> will retry in 2 s econds 2005/03/14 17:59 CET [cpushare_protocol,client] Stopping factory <cpushare.proto.cpushare_factory instance at 0x2aaaad414290> 2005/03/14 17:59 CET [-] Starting factory <cpushare.proto.cpushare_factory instance at 0x2aaaad414290> So my first priority is to understand why it stopped trying to reconnect (which is the major bug) and the second priority is to understand why it was going in timeout in the first place. (I can't exclude there have been a temporary network disruption that caused the keepalive to trigger the disconnect.) Could this be a bug in 1.3.0a? I expect the client will be mostly run with 1.3.0a, only on the server side I use SVN + pending fixes. This is the reconnecting code: class cpushare_factory(ReconnectingClientFactory): maxDelay = 600 # limit the maximum delay to 10 min protocol = cpushare_protocol def buildProtocol(self, addr): self.resetDelay() protocol = self.protocol() assert not hasattr(protocol, 'factory') protocol.factory = self return protocol def clientConnectionFailed(self, connector, reason): print 'Connection failed. Reason:', reason ReconnectingClientFactory.clientConnectionFailed(self, connector, reason) def connectionMade(self): self.transport.setTcpKeepAlive(1) Is the above correct? It works fine when the connection failed reason is "ConnectionRefusedError" instead of TimeoutError. What else should I do to prevent this error to leave the factory stopped? 2005/03/17 06:31 CET [-] Connection failed. Reason: [Failure instance: Traceback: twisted.internet.error.TimeoutError, User t imeout caused connection failure. Where does the "twisted.internet.error.TimeoutError" come from? The full source is LGPL and downloadable here: https://www.cpushare.com/downloads/cpushare-0.11.tar.bz2 The server side logs are absolutely not-interesting, all I get is a connectionLost event without any apparent error (server side is SVN-trunk). Any help is greatly appreciated, thanks!
On Thu, 17 Mar 2005 13:40:56 +0100, Andrea Arcangeli <andrea@cpushare.com> wrote:
From a client I'm getting this error:
[snip - traceback and log]
This is a reconnecting client factory, the python version is 2.3.4 and twisted version is 1.3.0a. The socket should sit in a idle state. No communication over that socket will happen (the app is under development), but it should not go in timeout unless the connection with the server ends and the keepalive events triggers a disconnect (I enabled keepalive on the tcp level).
Even if it goes in timeout it must try to reconnect immediatly while it seems like it's hanging after the "Stopping factory".
Earlier when I got the connection timed out event (for no apparent good reason) at least it was immediatly trying to reconnect:
[snip - traceback and log]
2005/03/14 17:59 CET [cpushare_protocol,client] <twisted.internet.ssl.Connector instance at 0x2aaaac28d950> will retry in 2 s econds 2005/03/14 17:59 CET [cpushare_protocol,client] Stopping factory <cpushare.proto.cpushare_factory instance at 0x2aaaad414290> 2005/03/14 17:59 CET [-] Starting factory <cpushare.proto.cpushare_factory instance at 0x2aaaad414290>
So my first priority is to understand why it stopped trying to reconnect (which is the major bug) and the second priority is to understand why it was going in timeout in the first place. (I can't exclude there have been a temporary network disruption that caused the keepalive to trigger the disconnect.)
For some reason unfathomable to me, ReconnectingClientFactory _stops_ trying to reconnect if a UserError is the cause of failed connection. Further, for some reason, error.TimeoutError subclasses UserError. This has bitten at least one other project (buildbot).
Could this be a bug in 1.3.0a? I expect the client will be mostly run with 1.3.0a, only on the server side I use SVN + pending fixes.
I'm inclined to say that it is indeed a bug. I think ReconnectingClientFactory should always retry the connection, regardless of the exception with which the previous attempt fails. If a program wants to allow a user to interrupt the retry logic, there is a "stopTrying" method.
This is the reconnecting code:
class cpushare_factory(ReconnectingClientFactory): maxDelay = 600 # limit the maximum delay to 10 min
protocol = cpushare_protocol
def buildProtocol(self, addr): self.resetDelay() protocol = self.protocol() assert not hasattr(protocol, 'factory') protocol.factory = self return protocol
def clientConnectionFailed(self, connector, reason): print 'Connection failed. Reason:', reason ReconnectingClientFactory.clientConnectionFailed(self, connector, reason)
If you look at twisted/internet/protocol.py for the definition of ReconnectingClientFactory.clientConnectionFailed, it should be pretty obvious how you want to redefine clientConnectionFailed to avoid the behavior you're seeing.
def connectionMade(self): self.transport.setTcpKeepAlive(1)
Is the above correct? It works fine when the connection failed reason is "ConnectionRefusedError" instead of TimeoutError.
What else should I do to prevent this error to leave the factory stopped?
2005/03/17 06:31 CET [-] Connection failed. Reason: [Failure instance: Traceback: twisted.internet.error.TimeoutError, User t imeout caused connection failure.
Where does the "twisted.internet.error.TimeoutError" come from?
It's generated internally by Twisted when the alloted connection time has elapsed without a connection being created. Most likely it _is_ network related problems that caused the connection to fail, but Twisted is certainly responsible for the decision to cease further reconnection attempts. Jp
On Thu, Mar 17, 2005 at 02:28:06PM +0000, Jp Calderone wrote:
For some reason unfathomable to me, ReconnectingClientFactory _stops_ trying to reconnect if a UserError is the cause of failed connection. Further, for some reason, error.TimeoutError subclasses UserError. This has bitten at least one other project (buildbot).
Ok great that you could explain it so quickly!
I'm inclined to say that it is indeed a bug. I think ReconnectingClientFactory should always retry the connection, regardless of the exception with which the previous attempt fails. If a program wants to allow a user to interrupt the retry logic, there is a "stopTrying" method.
Definitely agreed. I think at very least the error.TimeoutError should not stop the reconnection (dunno about the others subclassing UserError). For sure I'd rather prefer that it tries to reconnect no matter what.
If you look at twisted/internet/protocol.py for the definition of ReconnectingClientFactory.clientConnectionFailed, it should be pretty obvious how you want to redefine clientConnectionFailed to avoid the behavior you're seeing.
I'll look into a workaround.
It's generated internally by Twisted when the alloted connection time has elapsed without a connection being created.
Most likely it _is_ network related problems that caused the connection to fail, but Twisted is certainly responsible for the decision to cease further reconnection attempts.
Ok.
On Thu, 2005-03-17 at 13:40 +0100, Andrea Arcangeli wrote:
Even if it goes in timeout it must try to reconnect immediatly while it seems like it's hanging after the "Stopping factory".
There was a bug in 1.3 in ReconnectingClientFactory that caused it to not reconnect on TimeoutErrors. It's fixed in 2.0.
On Thu, Mar 17, 2005 at 11:51:04AM -0500, Itamar Shtull-Trauring wrote:
There was a bug in 1.3 in ReconnectingClientFactory that caused it to not reconnect on TimeoutErrors. It's fixed in 2.0.
Ok. This should workaround the bug in my app: Index: cpushare/proto.py =================================================================== RCS file: /home/andrea/crypto/cvs/cpushare/client/cpushare/cpushare/proto.py,v retrieving revision 1.29 diff -u -p -r1.29 proto.py --- cpushare/proto.py 13 Mar 2005 23:18:24 -0000 1.29 +++ cpushare/proto.py 17 Mar 2005 22:56:20 -0000 @@ -148,8 +148,24 @@ class cpushare_factory(ReconnectingClien protocol.factory = self return protocol + def clientConnectionLost(self, connector, reason): + print 'Lost connection. Reason:', reason + ReconnectingClientFactory.clientConnectionLost(self, connector, reason) + def clientConnectionFailed(self, connector, reason): print 'Connection failed. Reason:', reason + # workaround for bug in 1.3.0 - start + from twisted.copyright import version + if '1.3.0' in version: + def workaround_clientConnectionFailed(connector, reason): + if self.continueTrying: + self.connector = connector + from twisted.internet import interfaces, error, defer + if not reason.check(error.UserError) or reason.check(defer.TimeoutError): + self.retry() + workaround_clientConnectionFailed(connector, reason) + return + # workaround for bug in 1.3.0 - end ReconnectingClientFactory.clientConnectionFailed(self, connector, reason) def connectionMade(self): Thanks Itamar and Jp for the help!
On Thu, 17 Mar 2005 23:58:55 +0100, Andrea Arcangeli <andrea@cpushare.com> wrote:
On Thu, Mar 17, 2005 at 11:51:04AM -0500, Itamar Shtull-Trauring wrote:
There was a bug in 1.3 in ReconnectingClientFactory that caused it to not reconnect on TimeoutErrors. It's fixed in 2.0.
Ok. This should workaround the bug in my app:
Index: cpushare/proto.py =================================================================== RCS file: /home/andrea/crypto/cvs/cpushare/client/cpushare/cpushare/proto.py,v retrieving revision 1.29 diff -u -p -r1.29 proto.py --- cpushare/proto.py 13 Mar 2005 23:18:24 -0000 1.29 +++ cpushare/proto.py 17 Mar 2005 22:56:20 -0000 @@ -148,8 +148,24 @@ class cpushare_factory(ReconnectingClien protocol.factory = self return protocol
+ def clientConnectionLost(self, connector, reason): + print 'Lost connection. Reason:', reason + ReconnectingClientFactory.clientConnectionLost(self, connector, reason) + def clientConnectionFailed(self, connector, reason): print 'Connection failed. Reason:', reason + # workaround for bug in 1.3.0 - start + from twisted.copyright import version + if '1.3.0' in version: + def workaround_clientConnectionFailed(connector, reason): + if self.continueTrying: + self.connector = connector + from twisted.internet import interfaces, error, defer + if not reason.check(error.UserError) or reason.check(defer.TimeoutError): + self.retry() + workaround_clientConnectionFailed(connector, reason) + return + # workaround for bug in 1.3.0 - end ReconnectingClientFactory.clientConnectionFailed(self, connector, reason)
def connectionMade(self):
Thanks Itamar and Jp for the help!
Glad to offer it :) I wonder why you didn't go with the apparently equivalent but similar version of clientConnectionFailed, though: def clientConnectionFailed(self, connector, reason): print 'Connection failed. Reason:', reason # workaround for bug in 1.3.0 - start from twisted.copyright import version if '1.3.0' in version: if self.continueTrying: self.connector = connector self.retry() return # workaround for bug in 1.3.0 - end ReconnectingClientFactory.clientConnectionFailed( self, connector, reason) I guess I can see how defining a function for the logic further separates it from the rest of the code, but why define it inside the version test if suite? :) Maybe that was just the most convenient place at the moment.. The main difference, though, is the total absense of the if guarding the call to self.retry(). It seems like you duplicated the 1.3 logic in the 1.3 work-around code. Maybe I'm just misreading it, but it seems like you just want to omit that test and always retry(). Jp
participants (3)
-
Andrea Arcangeli -
Itamar Shtull-Trauring -
Jp Calderone