On 5 Aug, 03:59 pm, firstname.lastname@example.org wrote:
Background: I have a client program that calls callRemote, but the Deferred that callRemote returns is not fired. This is an intermittent error that only happens after some hours of traffic.
By putting some logging into AMP, it's apparent that the server gets as far as sending the reply using BoxDispatcher._safeEmit. The original version of that ignores connection errors, but I overrode it with one that doesn't:
On the client, BoxDispatcher._answerReceived logs all replies:
I always see both log.msgs or neither, so it isn't something going wrong in the callback.
The client sends a request. The server sends a reply with the same tag, and logs a message. Until it goes wrong, the client receives the reply and logs it. When it goes wrong, the client does not see the reply.
The protocol has the unfired Deferred in _outstandingRequests, with the missing tag as key.
All this suggests that the problem is either in the low-level network code, or somewhere in the network between the client and server. But doesn't TCP/IP tell you if a packet doesn't get through?
It doesn't really /tell/ you. But if there is some issue with the network that prevents packet delivery for long enough, then the connection breaks (giving you a connectionLost call in Twisted, which should errback your Deferred).
I tried tracing the packets using a python-pycapy script (attached), but it showed more packets disappearing than actually were.
It might be simpler to use something like wireshark, which already knows how to do TCP stream re-assembly and such.
I know this isn't much of a suggestion, but it's the only thing that really comes to mind.