[Twisted-Python] How to debug an AMP connection?
![](https://secure.gravatar.com/avatar/e0114f22fcde3deed8ebe94c70652140.jpg?s=120&d=mm&r=g)
Hello, I'm having trouble with an AMP connection that doesn't fire the Deferred returned by callRemote. The AMP command copies files from the client (called a "worker" in the code below) to a server ("controller"). It sends a chunk of text at a time - 16K originally, but I tried smaller amounts too. The server appends the text to a queue to be written; it shouldn't take long, because it doesn't do the I/O synchronously, just starts a thread. What happens is that the client calls callRemote, but the Deferred that callRemote returns is never fired. I know the command reaches the server, because the client keeps track of how many bytes have been sent successfully, and the file on the server is bigger than that. I know that the Deferred doesn't fire, because I can examine it using a manhole and it has called = False. There are no error messages in the log. That's consistent with the Deferred not firing, because the error callback (self.next_or_retry) would log a message if the remote command failed. So, what can possibly be happening? And how can I find out? All suggestions most gratefully received. Peter. Client-side code: class ConnectionBuffer(object): """Class to store AMP commands and retry them after network outages. ... """ def send(self, adv_arg, *a, **kw): if self.protocol: # call_d is here purely to make the Deferred accessible when debugging. self.call_d = self.protocol.callRemote(*a, **kw).addBoth(self.next_or_retry, adv_arg) def connected(self, protocol): self.protocol = protocol self.send_next() def disconnected(self): self.protocol = None Server: class ControllerProtocol(amp.AMP): """Protocol used by the controller to communicate with a worker. ... """ def logoutput(self, product, branch, site, job, action, number, data, stream): """Receive stdout or stderr from an action running on a worker.""" actid = (product, branch, site, job, action) ac = self.factory.controller.actions.find(actid) ac.instances[number].logfile[stream].write(data) return {'status': True} commands.LogOutput.responder(logoutput)
![](https://secure.gravatar.com/avatar/d6304567ada7ac5e8c6f4e5902270831.jpg?s=120&d=mm&r=g)
On Fri, Jul 24, 2009 at 9:35 AM, Peter Westlake<peter.westlake@pobox.com> wrote:
"justs starts a thread"? Where are you starting a thread in the code example you've posted?
Please post a complete example; I don't see methods on client code below such as send_next() or next_or_retry(). It's hard to help without a complete example. In general, though, if you're trying to debug Deferreds you might try setting debugging with defer.setDebugging(1).
![](https://secure.gravatar.com/avatar/e0114f22fcde3deed8ebe94c70652140.jpg?s=120&d=mm&r=g)
On Fri, 24 Jul 2009 10:44 -0400, "Drew Smathers" <drew.smathers@gmail.com> wrote:
Sorry - I was trying to avoid clogging everyone's inboxes with too much code. So far I haven't been able to reproduce the bug in a simple way. In the meantime, here's the code that starts the thread. It's one of the very first bits of code I wrote with either Twisted or Python, so it may do things in quite the wrong way. It serialises writes to a file by storing requests in a queue, and it gets called by the line ac.instances[number].logfile[stream].write(data) in the command handler. from twisted.internet import defer, threads import os, errno class FileMan(object): def __init__(self, filename): self.filename = filename self.queue = [] self.started = False def check_file(self): self.started = os.path.exists(self.filename) return self.started def makedir(self): dirname = os.path.dirname(self.filename) if not os.path.isdir(dirname): os.makedirs(dirname) def add(self, op, *a, **kw): def getnext(result): del(self.queue[0]) if len(self.queue) > 0: self.queue[0].callback('unused') return result d = defer.Deferred() d.addCallback(op, *a, **kw) d.addBoth(getnext) self.queue.append(d) if len(self.queue) == 1: d.callback('unused') return d def write(self, data, mode='a'): def do_write(result, filename, data): def write_in_thread(filename, data): f = open(filename, mode) try: f.write(data) finally: f.close() self.started = True d = threads.deferToThread(write_in_thread, filename, data) return d return self.add(do_write, self.filename, data)
What kind of output does that produce? I did try it, but didn't see anything out of the ordinary. Peter.
![](https://secure.gravatar.com/avatar/7ed9784cbb1ba1ef75454034b3a8e6a1.jpg?s=120&d=mm&r=g)
On Fri, 24 Jul 2009 16:12:28 +0100, Peter Westlake <peter.westlake@pobox.com> wrote:
It makes Deferreds keep track of the call stack when they are created and when they are invoked. This can be helpful tracking down Deferred failures that don't otherwise identify themselves very well. If you're not seeing any failures being logged, then this probably won't help much, since there won't be anything to increase the verbosity of. :) Another thing to check out would be a network capture, to see if the server is actually sending back a response. If not, then you know you should look at the server code more carefully. If so, then the client code is probably at fault. Jean-Paul
![](https://secure.gravatar.com/avatar/e0114f22fcde3deed8ebe94c70652140.jpg?s=120&d=mm&r=g)
On Fri, 24 Jul 2009 12:01 -0400, "Jean-Paul Calderone" <exarkun@divmod.com> wrote:
On Fri, 24 Jul 2009 16:12:28 +0100, Peter Westlake <peter.westlake@pobox.com> wrote:
That would explain it :-)
I tried that, and was somewhat overwhelmed by the amount of data! But now I've found the python-pcapy package and I'm well on the way to having a script that can match AMP commands and responses and report any that don't match up. If it turns out to be useful, I could post it, as long as my employers don't mind. Thanks for the help, Peter.
![](https://secure.gravatar.com/avatar/e0114f22fcde3deed8ebe94c70652140.jpg?s=120&d=mm&r=g)
Background: I have a client program that calls callRemote, but the Deferred that callRemote returns is not fired. This is an intermittent error that only happens after some hours of traffic. By putting some logging into AMP, it's apparent that the server gets as far as sending the reply using BoxDispatcher._safeEmit. The original version of that ignores connection errors, but I overrode it with one that doesn't: trace_tags = False def _safeEmit(self, aBox): """ Emit a box, ignoring L{ProtocolSwitched} and L{ConnectionLost} errors which cannot be usefully handled. """ tag = None if amp.ANSWER in aBox: tag = aBox[amp.ANSWER] elif amp.ERROR in aBox: tag = aBox[amp.ERROR] if self.trace_tags and tag: log.debug('_safeEmit', tag) try: aBox._sendTo(self.boxSender) except (ProtocolSwitched, ConnectionLost): log.debug('ProtocolSwitched or ConnectionLost in _safeEmit', tag) The message in the last line is never seen, nor is there anything else in the logs on client or server to suggest that there has been any problem with the connection. On the client, BoxDispatcher._answerReceived logs all replies: def _answerReceived(self, box): """ An AMP box was received that answered a command previously sent with L{callRemote}. @param box: an AmpBox with a value for its L{ANSWER} key. """ b = box[ANSWER] question = self._outstandingRequests.pop(box[ANSWER]) question.addErrback(self.unhandledError) log.msg('_answerReceived', b, question) question.callback(box) log.msg('_answerReceived', b, 'returning') Likewise _errorReceived. I always see both log.msgs or neither, so it isn't something going wrong in the callback. The client sends a request. The server sends a reply with the same tag, and logs a message. Until it goes wrong, the client receives the reply and logs it. When it goes wrong, the client does not see the reply. The protocol has the unfired Deferred in _outstandingRequests, with the missing tag as key. All this suggests that the problem is either in the low-level network code, or somewhere in the network between the client and server. But doesn't TCP/IP tell you if a packet doesn't get through? I tried tracing the packets using a python-pycapy script (attached), but it showed more packets disappearing than actually were. That could be for several reasons, including the packet sniffer missing packets, and the _ask or _answer key not being the first thing in the box. It also made the problem go away to a large extent, though that may have been the "tail -F" I was running against the log. It does take a lot longer for the bug to manifest when tracing it. At this point, I'm stuck! If there isn't a solution to this, the only answer will be to time out and retry the command. But since TCP connections are meant to be reliable (or to give an error if they fail), I hope it won't come to that. Peter.
![](https://secure.gravatar.com/avatar/607cfd4a5b41fe6c886c978128b9c03e.jpg?s=120&d=mm&r=g)
On 5 Aug, 03:59 pm, peter.westlake@pobox.com wrote:
It doesn't really /tell/ you. But if there is some issue with the network that prevents packet delivery for long enough, then the connection breaks (giving you a connectionLost call in Twisted, which should errback your Deferred).
I tried tracing the packets using a python-pycapy script (attached), but it showed more packets disappearing than actually were.
It might be simpler to use something like wireshark, which already knows how to do TCP stream re-assembly and such. I know this isn't much of a suggestion, but it's the only thing that really comes to mind. Jean-Paul
![](https://secure.gravatar.com/avatar/e0114f22fcde3deed8ebe94c70652140.jpg?s=120&d=mm&r=g)
On Sat, 08 Aug 2009 16:32 +0000, exarkun@twistedmatrix.com wrote:
No, it's a good suggestion - the only problem is that it makes the problem happen far less often! What I've done is to capture the first few bytes of each packet, enough to show the whole of the AMP reply packets (which are just {'_answer':tag, 'status':True}) without any need for full-scale analysis and reassembly. I'm pleased to say that after many hours of watching tcpdump, the bug eventually happened again just minutes before I left for a long weekend. Sure enough, TCP/IP delivers the packet successfully. Tomorrow's first task is going to be putting tracing into the low-level code where the reactor is meant to receive the packet and call the AMP _outstandingRequests entry for it. It would have been almost impossible to get even this far without the Manhole feature. I've been using it all along to inspect the workings of the client code, and it was a big help on the server side too. The server handles a lot of connections, and I was able to set a debug flag in the protocol object for the connection I was tracing. There would have been very much more log data otherwise. It makes a big difference being able to see values of variables without having to add new log statements and spend another working day waiting for the bug. Whoever invented it - thank you! Peter.
![](https://secure.gravatar.com/avatar/d6304567ada7ac5e8c6f4e5902270831.jpg?s=120&d=mm&r=g)
On Fri, Jul 24, 2009 at 9:35 AM, Peter Westlake<peter.westlake@pobox.com> wrote:
"justs starts a thread"? Where are you starting a thread in the code example you've posted?
Please post a complete example; I don't see methods on client code below such as send_next() or next_or_retry(). It's hard to help without a complete example. In general, though, if you're trying to debug Deferreds you might try setting debugging with defer.setDebugging(1).
![](https://secure.gravatar.com/avatar/e0114f22fcde3deed8ebe94c70652140.jpg?s=120&d=mm&r=g)
On Fri, 24 Jul 2009 10:44 -0400, "Drew Smathers" <drew.smathers@gmail.com> wrote:
Sorry - I was trying to avoid clogging everyone's inboxes with too much code. So far I haven't been able to reproduce the bug in a simple way. In the meantime, here's the code that starts the thread. It's one of the very first bits of code I wrote with either Twisted or Python, so it may do things in quite the wrong way. It serialises writes to a file by storing requests in a queue, and it gets called by the line ac.instances[number].logfile[stream].write(data) in the command handler. from twisted.internet import defer, threads import os, errno class FileMan(object): def __init__(self, filename): self.filename = filename self.queue = [] self.started = False def check_file(self): self.started = os.path.exists(self.filename) return self.started def makedir(self): dirname = os.path.dirname(self.filename) if not os.path.isdir(dirname): os.makedirs(dirname) def add(self, op, *a, **kw): def getnext(result): del(self.queue[0]) if len(self.queue) > 0: self.queue[0].callback('unused') return result d = defer.Deferred() d.addCallback(op, *a, **kw) d.addBoth(getnext) self.queue.append(d) if len(self.queue) == 1: d.callback('unused') return d def write(self, data, mode='a'): def do_write(result, filename, data): def write_in_thread(filename, data): f = open(filename, mode) try: f.write(data) finally: f.close() self.started = True d = threads.deferToThread(write_in_thread, filename, data) return d return self.add(do_write, self.filename, data)
What kind of output does that produce? I did try it, but didn't see anything out of the ordinary. Peter.
![](https://secure.gravatar.com/avatar/7ed9784cbb1ba1ef75454034b3a8e6a1.jpg?s=120&d=mm&r=g)
On Fri, 24 Jul 2009 16:12:28 +0100, Peter Westlake <peter.westlake@pobox.com> wrote:
It makes Deferreds keep track of the call stack when they are created and when they are invoked. This can be helpful tracking down Deferred failures that don't otherwise identify themselves very well. If you're not seeing any failures being logged, then this probably won't help much, since there won't be anything to increase the verbosity of. :) Another thing to check out would be a network capture, to see if the server is actually sending back a response. If not, then you know you should look at the server code more carefully. If so, then the client code is probably at fault. Jean-Paul
![](https://secure.gravatar.com/avatar/e0114f22fcde3deed8ebe94c70652140.jpg?s=120&d=mm&r=g)
On Fri, 24 Jul 2009 12:01 -0400, "Jean-Paul Calderone" <exarkun@divmod.com> wrote:
On Fri, 24 Jul 2009 16:12:28 +0100, Peter Westlake <peter.westlake@pobox.com> wrote:
That would explain it :-)
I tried that, and was somewhat overwhelmed by the amount of data! But now I've found the python-pcapy package and I'm well on the way to having a script that can match AMP commands and responses and report any that don't match up. If it turns out to be useful, I could post it, as long as my employers don't mind. Thanks for the help, Peter.
![](https://secure.gravatar.com/avatar/e0114f22fcde3deed8ebe94c70652140.jpg?s=120&d=mm&r=g)
Background: I have a client program that calls callRemote, but the Deferred that callRemote returns is not fired. This is an intermittent error that only happens after some hours of traffic. By putting some logging into AMP, it's apparent that the server gets as far as sending the reply using BoxDispatcher._safeEmit. The original version of that ignores connection errors, but I overrode it with one that doesn't: trace_tags = False def _safeEmit(self, aBox): """ Emit a box, ignoring L{ProtocolSwitched} and L{ConnectionLost} errors which cannot be usefully handled. """ tag = None if amp.ANSWER in aBox: tag = aBox[amp.ANSWER] elif amp.ERROR in aBox: tag = aBox[amp.ERROR] if self.trace_tags and tag: log.debug('_safeEmit', tag) try: aBox._sendTo(self.boxSender) except (ProtocolSwitched, ConnectionLost): log.debug('ProtocolSwitched or ConnectionLost in _safeEmit', tag) The message in the last line is never seen, nor is there anything else in the logs on client or server to suggest that there has been any problem with the connection. On the client, BoxDispatcher._answerReceived logs all replies: def _answerReceived(self, box): """ An AMP box was received that answered a command previously sent with L{callRemote}. @param box: an AmpBox with a value for its L{ANSWER} key. """ b = box[ANSWER] question = self._outstandingRequests.pop(box[ANSWER]) question.addErrback(self.unhandledError) log.msg('_answerReceived', b, question) question.callback(box) log.msg('_answerReceived', b, 'returning') Likewise _errorReceived. I always see both log.msgs or neither, so it isn't something going wrong in the callback. The client sends a request. The server sends a reply with the same tag, and logs a message. Until it goes wrong, the client receives the reply and logs it. When it goes wrong, the client does not see the reply. The protocol has the unfired Deferred in _outstandingRequests, with the missing tag as key. All this suggests that the problem is either in the low-level network code, or somewhere in the network between the client and server. But doesn't TCP/IP tell you if a packet doesn't get through? I tried tracing the packets using a python-pycapy script (attached), but it showed more packets disappearing than actually were. That could be for several reasons, including the packet sniffer missing packets, and the _ask or _answer key not being the first thing in the box. It also made the problem go away to a large extent, though that may have been the "tail -F" I was running against the log. It does take a lot longer for the bug to manifest when tracing it. At this point, I'm stuck! If there isn't a solution to this, the only answer will be to time out and retry the command. But since TCP connections are meant to be reliable (or to give an error if they fail), I hope it won't come to that. Peter.
![](https://secure.gravatar.com/avatar/607cfd4a5b41fe6c886c978128b9c03e.jpg?s=120&d=mm&r=g)
On 5 Aug, 03:59 pm, peter.westlake@pobox.com wrote:
It doesn't really /tell/ you. But if there is some issue with the network that prevents packet delivery for long enough, then the connection breaks (giving you a connectionLost call in Twisted, which should errback your Deferred).
I tried tracing the packets using a python-pycapy script (attached), but it showed more packets disappearing than actually were.
It might be simpler to use something like wireshark, which already knows how to do TCP stream re-assembly and such. I know this isn't much of a suggestion, but it's the only thing that really comes to mind. Jean-Paul
![](https://secure.gravatar.com/avatar/e0114f22fcde3deed8ebe94c70652140.jpg?s=120&d=mm&r=g)
On Sat, 08 Aug 2009 16:32 +0000, exarkun@twistedmatrix.com wrote:
No, it's a good suggestion - the only problem is that it makes the problem happen far less often! What I've done is to capture the first few bytes of each packet, enough to show the whole of the AMP reply packets (which are just {'_answer':tag, 'status':True}) without any need for full-scale analysis and reassembly. I'm pleased to say that after many hours of watching tcpdump, the bug eventually happened again just minutes before I left for a long weekend. Sure enough, TCP/IP delivers the packet successfully. Tomorrow's first task is going to be putting tracing into the low-level code where the reactor is meant to receive the packet and call the AMP _outstandingRequests entry for it. It would have been almost impossible to get even this far without the Manhole feature. I've been using it all along to inspect the workings of the client code, and it was a big help on the server side too. The server handles a lot of connections, and I was able to set a debug flag in the protocol object for the connection I was tracing. There would have been very much more log data otherwise. It makes a big difference being able to see values of variables without having to add new log statements and spend another working day waiting for the bug. Whoever invented it - thank you! Peter.
participants (4)
-
Drew Smathers
-
exarkun@twistedmatrix.com
-
Jean-Paul Calderone
-
Peter Westlake