[Twisted-Python] Many connections and TIME_WAIT
![](https://secure.gravatar.com/avatar/1ab02b123e4645acb4a53fe974d1742b.jpg?s=120&d=mm&r=g)
I've been prototyping a client that connects to thousands of servers and calls some method. It's not real important to me at this stage whether that's via xmlrpc, perspective broker, or something else. What seems to happen on the client machine is that each network connection that gets opened and then closed goes into a TIME_WAIT state, and eventually there are so many connections in that state that it's impossible to create any more. I'm keeping an eye on the output of netstat -an | wc -l Initially I've got 569 entries there. When I run my test client, that ramps up really quickly and peaks at about 2824. At that point, the client reports a callRemoteFailure: callRemoteFailure [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion: Connection lost. Increasing the file descriptor limits doesn't seem to have any effect on this. Is there an established Twisted sanctioned canonical way to free up this resource? Or am I doing something wrong? I'm looking into tweaking SO_REUSEADDR and SO_LINGER - that sound sane? Just tapping the lazywebs to see if anyone's already seen this in the wild. Thanks guys Donal
![](https://secure.gravatar.com/avatar/607cfd4a5b41fe6c886c978128b9c03e.jpg?s=120&d=mm&r=g)
On 04:50 am, donal.mcmullan@gmail.com wrote:
I've been prototyping a client that connects to thousands of servers and calls some method. It's not real important to me at this stage whether that's via xmlrpc, perspective broker, or something else.
What seems to happen on the client machine is that each network connection that gets opened and then closed goes into a TIME_WAIT state, and eventually there are so many connections in that state that it's impossible to create any more.
Yep. That's what happens to a TCP connection when you close it.
I'm keeping an eye on the output of netstat -an | wc -l Initially I've got 569 entries there. When I run my test client, that ramps up really quickly and peaks at about 2824. At that point, the client reports a callRemoteFailure:
Presumably these numbers have something to do with how quickly you're opening and closing new connections. TIME_WAIT lasts for 2MSL (4 minutes) to ensure that a future connection doesn't receive data intended for a previous connection (clearly a bad thing). However... 2824 is a pretty low number at which to run out of sockets. Perhaps you're running this software on Windows? I think Windows has a ridiculously small number of "client sockets" allocated by default. I seem to recall this being something you can change with a registry edit or something like that. Another option would be to switch to a POSIX-platform instead. If you're *not* on Windows, then this is odd and perhaps bears further scrutiny.
callRemoteFailure [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion: Connection lost.
This isn't exactly how I'd expect it to fail, but I also don't know what "callRemoteFailure" is or where it comes from, so maybe that's not too surprising.
Increasing the file descriptor limits doesn't seem to have any effect on this.
Quite so. The process has, after all, already closed these sockets. They no longer count towards the process's file descriptor limit (oh dear, I suppose you're not using Windows if you have a file descriptor limit to raise).
Is there an established Twisted sanctioned canonical way to free up this resource? Or am I doing something wrong? I'm looking into tweaking SO_REUSEADDR and SO_LINGER - that sound sane?
Just tapping the lazywebs to see if anyone's already seen this in the wild.
On most reasonably configured Linux machines, you shouldn't run into this problem until you're doing at least an order of magnitude more work. Many times, I have run clients that do many thousands of new connections per second, resulting in tens of thousands of TIME_WAIT sockets on the system with no problem. So, I'm not sure why you're running into this after only a few thousand. Jean-Paul
![](https://secure.gravatar.com/avatar/1ab02b123e4645acb4a53fe974d1742b.jpg?s=120&d=mm&r=g)
I've been meaning to update this for a while - it turned out to be caused by a bug in my code. X-o Sorry guys - & thanks for helping me work it out.
participants (2)
-
Donal McMullan
-
exarkun@twistedmatrix.com