[Twisted-Python] Handling PBConnectionLost errors
Hello, Twisted PB sometimes loses its connection to the server. In this case, a PBConnectionLost exception is raised on the client. It would be nice to implement a fail-safe(er) way of calling remote methods that would retry when necessary until the remote method has been called successfully and the result has been returned. Note that this is only necessary when the remote method call should be invoked exactly once on the server (i.e. for POST-like calls that change server state). In the case of GET-like requests, a simpler retry mechanism will do. The motivation for this is based on my experience of using Twisted in an application I am developing. The network communications are all happening on a LAN. The good end of the network is connected directly to a 100Mbps switch at the server. Failures occur more frequently at the other end (my end) of the network, which is connected through a 10/100 hub that is connected to the main switch. I rigged up a quick test with a 1000-request sample size; failures ranged from 28/1000 on the good end of the network to 83/1000 on the bad end of the network. One request consists of a single remote method call through PB. A success was when I got the expected result, a failure was when I got a PBConnectionLost error. The following is pseudo code that I came up with to mitigate the problem. Simple request (GET - repeatedly call method until success or RETRY_LIMIT is reached) Client flow: for x in range(RETRY_LIMIT) invoke remote method without unique call identifier if result is not PBConnectionLost break if result is PBConnectionLost raise server not responding error Server flow: (nothing special, just plain PB) Complex request (POST - server-side method is invoked exactly once) Client flow: use simple retry method to get a unique call identifier from server a timeout value is also sent along to tell the server how long to hold the results of this request for x in range(RETRY_LIMIT) invoke remote method with identifier if return value is not PBConnectionLost break if result is PBConnectionLost raise server not responding error using simple retry method tell server to discard unique call identifier Server flow: receive request for unique call identifier create and store identifier with UNCALLED token schedule identifier to be discarded with timeout value supplied by client return identifier to client receive remote method invocation with unique call identifier branch on value stored with unique call identifier if UNCALLED update identifier with CALLED token invoke method while result is deferred get defer result store COMPLETED token and unique with unique call identifier if there is another invocation WAITING this means the connection was lost signal the WAITING request with the result else return result to client if CALLED store WAITING token with unique identifier (must not overwrite other call tokens) defer until COMPLETED if COMPLETED return result to client if unique call identifier does not exist raise error receive request to discard unique call identifier if identifier exists discard identifier, tokens, and result return True I realize that implementing this would not eliminate network errors. It would simply reduce the likelyhood of failed method calls due to dropped connections. If I have my math correct (I always struggle a bit with statistics), even a RETRY_LIMIT of 2 would reduce the probability of a lost connection to 0.6% at the worst (<0.1% on the good end of the network). I have two questions: 1. Does something like this already exist? 2. Is this a totally stupid idea? (would it be better to improve our physical network than to try to band-aid the problem with something like this?) ~ Daniel
Is this such a stupid question that it doesn't even warrant a response? ~ Daniel On Jul 20, 2007, at 11:52 AM, Daniel Miller wrote:
Hello,
Twisted PB sometimes loses its connection to the server. In this case, a PBConnectionLost exception is raised on the client. It would be nice to implement a fail-safe(er) way of calling remote methods that would retry when necessary until the remote method has been called successfully and the result has been returned. Note that this is only necessary when the remote method call should be invoked exactly once on the server (i.e. for POST-like calls that change server state). In the case of GET-like requests, a simpler retry mechanism will do.
The motivation for this is based on my experience of using Twisted in an application I am developing. The network communications are all happening on a LAN. The good end of the network is connected directly to a 100Mbps switch at the server. Failures occur more frequently at the other end (my end) of the network, which is connected through a 10/100 hub that is connected to the main switch. I rigged up a quick test with a 1000-request sample size; failures ranged from 28/1000 on the good end of the network to 83/1000 on the bad end of the network. One request consists of a single remote method call through PB. A success was when I got the expected result, a failure was when I got a PBConnectionLost error.
The following is pseudo code that I came up with to mitigate the problem.
Simple request (GET - repeatedly call method until success or RETRY_LIMIT is reached) Client flow: for x in range(RETRY_LIMIT) invoke remote method without unique call identifier if result is not PBConnectionLost break if result is PBConnectionLost raise server not responding error Server flow: (nothing special, just plain PB)
Complex request (POST - server-side method is invoked exactly once) Client flow: use simple retry method to get a unique call identifier from server a timeout value is also sent along to tell the server how long to hold the results of this request for x in range(RETRY_LIMIT) invoke remote method with identifier if return value is not PBConnectionLost break if result is PBConnectionLost raise server not responding error using simple retry method tell server to discard unique call identifier Server flow: receive request for unique call identifier create and store identifier with UNCALLED token schedule identifier to be discarded with timeout value supplied by client return identifier to client receive remote method invocation with unique call identifier branch on value stored with unique call identifier if UNCALLED update identifier with CALLED token invoke method while result is deferred get defer result store COMPLETED token and unique with unique call identifier if there is another invocation WAITING this means the connection was lost signal the WAITING request with the result else return result to client if CALLED store WAITING token with unique identifier (must not overwrite other call tokens) defer until COMPLETED if COMPLETED return result to client if unique call identifier does not exist raise error receive request to discard unique call identifier if identifier exists discard identifier, tokens, and result return True
I realize that implementing this would not eliminate network errors. It would simply reduce the likelyhood of failed method calls due to dropped connections. If I have my math correct (I always struggle a bit with statistics), even a RETRY_LIMIT of 2 would reduce the probability of a lost connection to 0.6% at the worst (<0.1% on the good end of the network).
I have two questions:
1. Does something like this already exist? 2. Is this a totally stupid idea? (would it be better to improve our physical network than to try to band-aid the problem with something like this?)
Daniel Miller wrote:
I have two questions:
1. Does something like this already exist? 2. Is this a totally stupid idea? (would it be better to improve our physical network than to try to band-aid the problem with something like this?)
Daniel Miller wrote:
Is this such a stupid question that it doesn't even warrant a response?
I was waiting for somebody else to speak up before embarrassing myself, but anyway, here goes. ;-) We had similar problems at a previous job of mine, and successfully used a similar approach to solve them; therefore you have my blessing, for what it's worth. -- Nicola Larosa - http://www.tekNico.net/ Our criticisms of WS-* are specific and have to do with issues of process and stability and technical quality and a demonstrated lack of interoper- ability. It is badly-engineered technology, using it will increase the likelihood that your project fails, and it is not suitable for use by conscientious IT professionals. -- Tim Bray, February 2007
On Jul 25, 2007, at 10:59 AM, Nicola Larosa wrote:
Daniel Miller wrote:
I have two questions:
1. Does something like this already exist? 2. Is this a totally stupid idea? (would it be better to improve our physical network than to try to band-aid the problem with something like this?)
We had similar problems at a previous job of mine, and successfully used a similar approach to solve them; therefore you have my blessing, for what it's worth.
Thanks Nicola. ~ Daniel
On Wed, 2007-07-25 at 10:38 -0400, Daniel Miller wrote:
Is this such a stupid question that it doesn't even warrant a response?
Not really, but it's a more complex question than it first seems, and you're trying very hard to reproduce things that already exist. First question: why is your PB server allocating "request IDs" and storing hashes of them. You could just return a pb.Referenceable to the client, backed by a normal python object (with a normal object lifecycle) on the server. Regarding the retry mechanism: It looks to me like you're treating PB like an RPC mechanism, and finding that the nature of networks (they fail, unpredictably) is tripping you up. Try thinking of it in a more message-oriented way. I would implement it something like this, using the python2.5 yield and inlineCallbacks functionality: class MyResources: def render_POST(self, request): d = self.doOnlyOnce(request) d.addCallbacks(self.done, self.failed, (request,), (request,)) return server.NOT_DONE_YET def done(self, data, request): request.write(data) request.finish() def failed(self, f, request): request.write("An error occured") request.write(f.getErrorMessage()) reqeust.finish() @inlineCallbacks def doOnlyOnce(self, request): pbroot = yield self.pbclifactory.getRootObject() theobject = yield pbroot.callRemote('makeRequest') data = yield theobject.callRemote('someMethod', request.args) # some local code here, then... # ok, done, tell the remote object to "commit" (finalise, delete) yield theobject.callRemote('commit') returnValue(data) Your example is a bit theoretical, so it's difficult to see if this would work for you.
On Jul 25, 2007, at 11:09 AM, Phil Mayers wrote:
First question: why is your PB server allocating "request IDs" and storing hashes of them. You could just return a pb.Referenceable to the client, backed by a normal python object (with a normal object lifecycle) on the server.
That's something I hadn't thought of. A referenceable might simplify my problem a bit.
Regarding the retry mechanism: It looks to me like you're treating PB like an RPC mechanism, and finding that the nature of networks (they fail, unpredictably) is tripping you up. Try thinking of it in a more message-oriented way.
I would implement it something like this, using the python2.5 yield and inlineCallbacks functionality:
<snip>
Your example is a bit theoretical, so it's difficult to see if this would work for you.
I'll have to give that a deeper look later, but thanks a lot for the ideas and insight. ~ Daniel
Daniel Miller <daniel@keystonewood.com> writes:
Is this such a stupid question that it doesn't even warrant a response?
~ Daniel
I agree with the other comment to the effect that the lack of response may be more due to the underlying complexity of the question as to lack of interest. I know we definitely ran into similar issues in a large PB-based system I worked on a while ago, and in the end determined that we were best served by implementing our own system. For example, your opening point about:
(...) It would be nice to implement a fail-safe(er) way of calling remote methods that would retry when necessary until the remote method has been called successfully and the result has been returned. (...)
has an implicit assumption that the remote method will even continue to exist once the disconnect has occurred - something that is by no means guaranteed with PB. That is, what if the method you are trying to call is on a Referenceable you got back from the server, but it was to an object instance on the server that was created just for your client connection? The connection breaking will destroy that remote object and/or your ability to reconnect to it without special support on the server to keep it persistent. Not to mention however many other references to that remote object you may have in existence on the client which will no longer function even after a reconnect. That's not to say that there aren't plausible ways to achieve what you're looking for, but in general it becomes application specific, since you'll need knowledge as to how state management on your server is taking place, and what remote references are stable across connections. So if your use of random IDs and reconnect attempts is a workable way for you to manage the server state in such a way that it is reconnectable, then it may be perfectly good in your environment. Perhaps some earlier messages of mine when we had just finished putting together the remote wrapping and reconnect support in our system. See my responses to the thread at: http://twistedmatrix.com/pipermail/twisted-python/2005-July/011030.html and http://twistedmatrix.com/pipermail/twisted-python/2005-July/011046.html It hits on topics beyond that of just a reliable method call, but the second message more specifically talks about the wrapper that implements reconnections, and how we dealt with updating references post-reconnect. You can probably see how the design dovetailed with our particular server side structure (the registry was persistent as were the managers, so they provided the concrete point of reattachment). And the use of the wrappers around references meant we could "correct" the wrappers for a new connection without having to worry about what parts of the client application may have been holding references. Perhaps it will give you some other ideas in your own system. For your other points:
I have two questions:
1. Does something like this already exist?
There used to be a "sturdy" PB module in Twisted (looks like it's gone in later releases) to attempt to provide a more persistent server reference. Also, if I recall correctly there's a ReconnectingClientFactory class somewhere which, while not PB specific, was a way to implement reconnections purely at the factory/protocol level. Of course, that's never really the complex part in a PB application - it's figuring out what to do with your remote references. Some of the work in the publish.py and refpath.py PB modules are also attempts to solve some of the issues involved here. But I'm not aware of any existing approach that is generally suitable for any application. I rather doubt any single generic approach would be possible, since PB provides for many mechanisms of statement management and referenceability among servers and clients.
2. Is this a totally stupid idea? (would it be better to improve our physical network than to try to band-aid the problem with something like this?)
It's never a stupid idea to engineer for network interruptions, but like everything else a design must weigh benefits against cost/development. With that said, it might not be a bad idea to also look into your network. TCP connections are rather hard to break just due to network transmission problems, and all your PB calls are going across a single TCP session. They might be significantly delayed on a bad network, but the connection itself shouldn't fail unless something more extreme (and unusual) is happening. Given the level of problems you're encountering, I wouldn't be surprised if something else was awry. Of course, that level of network troubleshooting can have it's own cost/benefit analysis, and it might just be simpler to engineer around the problem at the application level as you are doing. For example, our system above was used over a WAN, and we actually had several relays each of which had their own wrappers for the next hop, so it was very important that while it might be down during an outage, it properly healed itself as soon as whichever segment had failed was reconnected. But we generally expected most outages to represent real network failures for a period of time (or a server going down), and less so a constant percentage of failing calls. Not that the networks couldn't have packet loss, but network packet loss has to reach several percent before really impacting TCP to the point where we would notice). But another PB-based application I'm working on now is less crucial. Should I lose the server connection, I basically close down active UI windows that were working with previous references and notify the user that a disconnect has occurred. They can then initiate a new connection when they want. -- David
David, You have gone above an beyond my expectations to answer my questions. Thank you. On Jul 28, 2007, at 1:07 AM, David Bolen wrote:
Daniel Miller <daniel@keystonewood.com> writes:
Is this such a stupid question that it doesn't even warrant a response?
~ Daniel
I agree with the other comment to the effect that the lack of response may be more due to the underlying complexity of the question as to lack of interest. ...
It's funny, my question was complex, but it nevertheless contained too many assumptions about my application and environment to allow you to answer easily. Thanks for taking a stab at it anyway.
For example, your opening point about:
(...) It would be nice to implement a fail-safe(er) way of calling remote methods that would retry when necessary until the remote method has been called successfully and the result has been returned. (...)
has an implicit assumption that the remote method will even continue to exist once the disconnect has occurred - something that is by no means guaranteed with PB.
I hadn't even thought of that, although now that you point it out it's obvious. My (server-side) application is just a singleton facade to an accounting system database. I'm posting orders from an order entry system to invoices in the accounting system. The server- supplied "referenceable" will always be available assuming something terrible has not happened to the server (e.g. crashed, hacked or physically damaged--none of which are things I'm trying to solve here).
Perhaps some earlier messages of mine when we had just finished putting together the remote wrapping and reconnect support in our system. See my responses to the thread at:
http://twistedmatrix.com/pipermail/twisted-python/2005-July/ 011030.html
and
http://twistedmatrix.com/pipermail/twisted-python/2005-July/ 011046.html
Thanks I'll take a look at them.
It hits on topics beyond that of just a reliable method call, but the second message more specifically talks about the wrapper that implements reconnections, and how we dealt with updating references post-reconnect. You can probably see how the design dovetailed with our particular server side structure (the registry was persistent as were the managers, so they provided the concrete point of reattachment). And the use of the wrappers around references meant we could "correct" the wrappers for a new connection without having to worry about what parts of the client application may have been holding references. Perhaps it will give you some other ideas in your own system.
This sounds good, I think I have a similar enough setup that I will be able to at least gain some good ideas.
For your other points:
I have two questions:
1. Does something like this already exist?
<snip>
... I'm not aware of any existing approach that is generally suitable for any application. I rather doubt any single generic approach would be possible, since PB provides for many mechanisms of statement management and referenceability among servers and clients.
You're probably right, although the problem domain is interesting enough to me that I may try to see what I can do if I ever get enough time :)
2. Is this a totally stupid idea? (would it be better to improve our physical network than to try to band-aid the problem with something like this?)
It's never a stupid idea to engineer for network interruptions, but like everything else a design must weigh benefits against cost/development. With that said, it might not be a bad idea to also look into your network. TCP connections are rather hard to break just due to network transmission problems, and all your PB calls are going across a single TCP session. They might be significantly delayed on a bad network, but the connection itself shouldn't fail unless something more extreme (and unusual) is happening. Given the level of problems you're encountering, I wouldn't be surprised if something else was awry.
That's what I thought (the connections shouldn't just be dropping for no apparent reason, especially since they are all within the bounds of a LAN). I know this is getting off topic, but I thought maybe you'd know: collisions on the hub should be handled by TCP, and my application should not have to worry about them, correct? Even that doesn't answer why there are dropped connections on the switched side of the network. Maybe we have some bad wiring? FWIW, I am planning to eliminate the hub in lieu of another switch (there are other problems as well). Again, thanks very much for your well-thought-out response. ~ Daniel
David, Nicola and Phil, Thanks very much for your feedback to my questions on this discussion thread. I have devised a working solution. Failed remote method calls due to dropped connections now seems to be a thing of the past (knock on wood...). I have attached a file that contains my "RecallClient" and "RecallServer" implementations as well as a very short example to show how they can be used (the usage examples are untested). Note that this implementation is tailored to my specific needs, and therefore is definitely not a general solution. Here are some notable limitations (there may be more): 1. The client is tightly coupled to the pb.Root object. 2. "Posting" server methods cannot return data structures that contain deferreds (e.g. a list of deferreds). I'm not sure if that's even supported by PB anyway? I haven't tried it so I don't know. However, maybe it will be useful to someone else? I'd be happy to hear your feedback if you decide to take a look at it. ~ Daniel
participants (4)
-
Daniel Miller
-
David Bolen
-
Nicola Larosa
-
Phil Mayers