Twisted PB sometimes loses its connection to the server. In this case, a PBConnectionLost exception is raised on the client. It would be nice to implement a fail-safe(er) way of calling remote methods that would retry when necessary until the remote method has been called successfully and the result has been returned. Note that this is only necessary when the remote method call should be invoked exactly once on the server (i.e. for POST-like calls that change server state). In the case of GET-like requests, a simpler retry mechanism will do.
The motivation for this is based on my experience of using Twisted in an application I am developing. The network communications are all happening on a LAN. The good end of the network is connected directly to a 100Mbps switch at the server. Failures occur more frequently at the other end (my end) of the network, which is connected through a 10/100 hub that is connected to the main switch. I rigged up a quick test with a 1000-request sample size; failures ranged from 28/1000 on the good end of the network to 83/1000 on the bad end of the network. One request consists of a single remote method call through PB. A success was when I got the expected result, a failure was when I got a PBConnectionLost error.
The following is pseudo code that I came up with to mitigate the problem.
Simple request (GET - repeatedly call method until success or RETRY_LIMIT is reached) Client flow: for x in range(RETRY_LIMIT) invoke remote method without unique call identifier if result is not PBConnectionLost break if result is PBConnectionLost raise server not responding error Server flow: (nothing special, just plain PB)
Complex request (POST - server-side method is invoked exactly once) Client flow: use simple retry method to get a unique call identifier from server a timeout value is also sent along to tell the server how long to hold the results of this request for x in range(RETRY_LIMIT) invoke remote method with identifier if return value is not PBConnectionLost break if result is PBConnectionLost raise server not responding error using simple retry method tell server to discard unique call identifier Server flow: receive request for unique call identifier create and store identifier with UNCALLED token schedule identifier to be discarded with timeout value supplied by client return identifier to client receive remote method invocation with unique call identifier branch on value stored with unique call identifier if UNCALLED update identifier with CALLED token invoke method while result is deferred get defer result store COMPLETED token and unique with unique call identifier if there is another invocation WAITING this means the connection was lost signal the WAITING request with the result else return result to client if CALLED store WAITING token with unique identifier (must not overwrite other call tokens) defer until COMPLETED if COMPLETED return result to client if unique call identifier does not exist raise error receive request to discard unique call identifier if identifier exists discard identifier, tokens, and result return True
I realize that implementing this would not eliminate network errors. It would simply reduce the likelyhood of failed method calls due to dropped connections. If I have my math correct (I always struggle a bit with statistics), even a RETRY_LIMIT of 2 would reduce the probability of a lost connection to 0.6% at the worst (<0.1% on the good end of the network).
I have two questions:
1. Does something like this already exist? 2. Is this a totally stupid idea? (would it be better to improve our physical network than to try to band-aid the problem with something like this?)