[Twisted-Python] Timeout with pb callRemote

I just tracked down a bug in one of our servers that uses twisted PB. The long and short of it was that the server made remote calls to clients that connected in and in some cases those clients would fall off the network (disconnected network cable, etc) but the server would not detect this. I tracked this down to TCP timeouts not telling twisted rapidly enough that the clients were offline. What I was going to use to solve this was to put a timeout on the remoteCall() by calling setTimeout on the deferred returned when making the call. Then if the deferred does not fire soon enough I could treat this as a dead client detection and clean up it's resources. The problem is that it looks like Deferred.setTimeout is deprecated. (see: http://twistedmatrix.com/trac/browser/tags/releases/twisted-9.0.0/twisted/in...) Is there some other suitable way to set a timeout on a remoteCall when using PB? -Allen

Allen Bierbaum <abierbaum@gmail.com> writes:
Right - by default (sans enabling keepalives at the TCP level), TCP can only detect a problem when it is attempting to transmit data, or when it receives data from a system that has been restarted. That's by design, since it can't tell if the idle time is expected or not. So if your request to the client makes it through but the connection breaks before the server needs to send any further data (such as waiting for a response) the server - waiting to receive - can essentially remain in that state forever. Even with keepalives turned on at the TCP level, the total time to declare a failure with default timers is often in the 2+ hour range.
Is there some other suitable way to set a timeout on a remoteCall when using PB?
I'd probably suggest implementing some connection monitoring mechanism in general for each client<->server connection, rather than trying to time out individual calls. The advantage to this is that it covers all sorts of failures in either direction and let's both sides fail any pending operations gracefully. What we did in one of our larger PB systems was have our client object, after connecting, set up a periodic ping request to the server. Failure of that request (in addition to a network failure of other requestss) would cause the client to disconnect (after generating an internal signal) and then fall into an automatic reconnection process. Since the ping is transmitting data over the session, failures will be detected much more rapidly (though still not instantaneously) when the TCP retransmit timers fail to deliver the data. We also had separate signaling and reconnect logic that allowed the client to reattach all of its existing remote object handles if it reconnected to a server that hadn't restarted (e.g., just a network outage), but that's more complicated and not suitable for all types of remote object references. While we didn't have requests originating from the server, you could have a mirror approach running on the server for each client, or you could just have a watchdog timer running on the server that disconnects a client if it hasn't heard a ping request from it in a given amount of time. On either side, explicitly disconnecting the connection will also cause any pending deferreds for PB requests to fail and trigger their errbacks. If you really wanted to implement a timeout for a specific request, you could still use a watchdog timer - start a callLater with the appropriate timeout, save the response object, and cancel it in the callback chain for the response once it is received. What you should do if the callLater does fire is less clear. Personally I'd probably do something internal so any eventual response to the pending deferred was ignored. You probably don't want to actually fire it yourself, since PB still references it and in theory could still get a response about it over the stream which would try to double-fire the deferred. That's part of why setTimeout on the deferred itself can be a bad idea - someone else probably also references that deferred and won't know it has already fired if the timeout expires. Disconnecting the client would work, as similar to the above keepalive approach, it would fire the errback on all pending deferreds over that session. -- David

Allen Bierbaum <abierbaum@gmail.com> writes:
Right - by default (sans enabling keepalives at the TCP level), TCP can only detect a problem when it is attempting to transmit data, or when it receives data from a system that has been restarted. That's by design, since it can't tell if the idle time is expected or not. So if your request to the client makes it through but the connection breaks before the server needs to send any further data (such as waiting for a response) the server - waiting to receive - can essentially remain in that state forever. Even with keepalives turned on at the TCP level, the total time to declare a failure with default timers is often in the 2+ hour range.
Is there some other suitable way to set a timeout on a remoteCall when using PB?
I'd probably suggest implementing some connection monitoring mechanism in general for each client<->server connection, rather than trying to time out individual calls. The advantage to this is that it covers all sorts of failures in either direction and let's both sides fail any pending operations gracefully. What we did in one of our larger PB systems was have our client object, after connecting, set up a periodic ping request to the server. Failure of that request (in addition to a network failure of other requestss) would cause the client to disconnect (after generating an internal signal) and then fall into an automatic reconnection process. Since the ping is transmitting data over the session, failures will be detected much more rapidly (though still not instantaneously) when the TCP retransmit timers fail to deliver the data. We also had separate signaling and reconnect logic that allowed the client to reattach all of its existing remote object handles if it reconnected to a server that hadn't restarted (e.g., just a network outage), but that's more complicated and not suitable for all types of remote object references. While we didn't have requests originating from the server, you could have a mirror approach running on the server for each client, or you could just have a watchdog timer running on the server that disconnects a client if it hasn't heard a ping request from it in a given amount of time. On either side, explicitly disconnecting the connection will also cause any pending deferreds for PB requests to fail and trigger their errbacks. If you really wanted to implement a timeout for a specific request, you could still use a watchdog timer - start a callLater with the appropriate timeout, save the response object, and cancel it in the callback chain for the response once it is received. What you should do if the callLater does fire is less clear. Personally I'd probably do something internal so any eventual response to the pending deferred was ignored. You probably don't want to actually fire it yourself, since PB still references it and in theory could still get a response about it over the stream which would try to double-fire the deferred. That's part of why setTimeout on the deferred itself can be a bad idea - someone else probably also references that deferred and won't know it has already fired if the timeout expires. Disconnecting the client would work, as similar to the above keepalive approach, it would fire the errback on all pending deferreds over that session. -- David
participants (2)
-
Allen Bierbaum
-
David Bolen