Re: [Twisted-Python] Handling PBConnectionLost errors

25 Jul 2007


      Is this such a stupid question that it doesn't even warrant a response?

~ Daniel


On Jul 20, 2007, at 11:52 AM, Daniel Miller wrote:
...
Hello,
Twisted PB sometimes loses its connection to the server. In this  
case, a PBConnectionLost exception is raised on the client. It  
would be nice to implement a fail-safe(er) way of calling remote  
methods that would retry when necessary until the remote method has  
been called successfully and the result has been returned. Note  
that this is only necessary when the remote method call should be  
invoked exactly once on the server (i.e. for POST-like calls that  
change server state). In the case of GET-like requests, a simpler  
retry mechanism will do.
The motivation for this is based on my experience of using Twisted  
in an application I am developing. The network communications are  
all happening on a LAN. The good end of the network is connected  
directly to a 100Mbps switch at the server. Failures occur more  
frequently at the other end (my end) of the network, which is  
connected through a 10/100 hub that is connected to the main  
switch. I rigged up a quick test with a 1000-request sample size;  
failures ranged from 28/1000 on the good end of the network to  
83/1000 on the bad end of the network. One request consists of a  
single remote method call through PB. A success was when I got the  
expected result, a failure was when I got a PBConnectionLost error.
The following is pseudo code that I came up with to mitigate the  
problem.
Simple request (GET - repeatedly call method until success or  
RETRY_LIMIT is reached)
   Client flow:
      for x in range(RETRY_LIMIT)
         invoke remote method without unique call identifier
         if result is not PBConnectionLost
            break
      if result is PBConnectionLost
         raise server not responding error
   Server flow:
      (nothing special, just plain PB)
Complex request (POST - server-side method is invoked exactly once)
   Client flow:
      use simple retry method to get a unique call identifier from  
server
         a timeout value is also sent along to tell the server how  
long to hold the results of this request
      for x in range(RETRY_LIMIT)
         invoke remote method with identifier
         if return value is not PBConnectionLost
            break
      if result is PBConnectionLost
         raise server not responding error
      using simple retry method tell server to discard unique call  
identifier
   Server flow:
      receive request for unique call identifier
         create and store identifier with UNCALLED token
         schedule identifier to be discarded with timeout value  
supplied by client
         return identifier to client
      receive remote method invocation with unique call identifier
         branch on value stored with unique call identifier
         if UNCALLED
            update identifier with CALLED token
            invoke method
            while result is deferred
               get defer result
            store COMPLETED token and unique with unique call  
identifier
            if there is another invocation WAITING
               this means the connection was lost
               signal the WAITING request with the result
            else
               return result to client
         if CALLED
            store WAITING token with unique identifier (must not  
overwrite other call tokens)
            defer until COMPLETED
         if COMPLETED
            return result to client
         if unique call identifier does not exist
            raise error
      receive request to discard unique call identifier
         if identifier exists
            discard identifier, tokens, and result
         return True
I realize that implementing this would not eliminate network  
errors. It would simply reduce the likelyhood of failed method  
calls due to dropped connections. If I have my math correct (I  
always struggle a bit with statistics), even a RETRY_LIMIT of 2  
would reduce the probability of a lost connection to 0.6% at the  
worst (<0.1% on the good end of the network).
I have two questions:
1. Does something like this already exist?
2. Is this a totally stupid idea? (would it be better to improve  
our physical network than to try to band-aid the problem with  
something like this?)