[Twisted-Python] intermittent problem: not accepting new connections
I have had a twisted epoll server that was heavily used, such that it saturated CPU (100% shown by "top", about 5000 connections, intense message relaying). I am using twisted 2.5.0 that I patched for epoll bug. It was run on python 2.4.4 , 2.6.11 kernel on a single core xeon 3.0 GHz CPU. This server has been on for many months, and it has been rock-stable. A couple of days ago I migrated that server to a newer machine: same patched twisted 2.5.0, same python 2.4.4, newer 2.6.24 kernel and a quad core xeon L5420 CPU. CPU usage dropped from 100% to 30%, as expected, with the same rate of client connections. However the server now has the following intermittent problem: about twice a day, it stops accepting new connections for a short period of 5-10 minutes. telnet times out, I get this: root@serv2:/proc/net/netfilter# telnet localhost 5229 Trying 127.0.0.1... Existing connections are not cut, they server receives/delivers messages to/from them just fine. These short periods of not accepting connections do not correlate with increased CPU load or with the overall number of connections to the server. I have had a problem with the same symptoms before, when a server process run out of its quota of file descriptors. However, there were clear messages in the twisted log at that time, and upping the ulimits solved the problem. This time, there are no errors in ANY logs (twisted log. /var/log/messages, etc) I am out of ideas on what this could be, because my setup is exactly the same as I have been using in the last year, except for a faster CPU and a newer kernel? I suspect that there are some new uncaught accept() exceptions in internet/tcp.py in the part where it's looking for EMFILE, ENOBUFS, ENFILE, ENOMEM, ECONNABORTED errors.
On Wed, 10 Sep 2008 13:32:06 -0700, Alec Matusis <matusis@yahoo.com> wrote:
I have had a twisted epoll server that was heavily used, such that it saturated CPU (100% shown by "top", about 5000 connections, intense message relaying). I am using twisted 2.5.0 that I patched for epoll bug. It was run on python 2.4.4 , 2.6.11 kernel on a single core xeon 3.0 GHz CPU. This server has been on for many months, and it has been rock-stable.
A couple of days ago I migrated that server to a newer machine: same patched twisted 2.5.0, same python 2.4.4, newer 2.6.24 kernel and a quad core xeon L5420 CPU. CPU usage dropped from 100% to 30%, as expected, with the same rate of client connections.
However the server now has the following intermittent problem: about twice a day, it stops accepting new connections for a short period of 5-10 minutes.
telnet times out, I get this: root@serv2:/proc/net/netfilter# telnet localhost 5229
Trying 127.0.0.1...
Existing connections are not cut, they server receives/delivers messages to/from them just fine. These short periods of not accepting connections do not correlate with increased CPU load or with the overall number of connections to the server.
I have had a problem with the same symptoms before, when a server process run out of its quota of file descriptors. However, there were clear messages in the twisted log at that time, and upping the ulimits solved the problem. This time, there are no errors in ANY logs (twisted log. /var/log/messages, etc)
Do non-error log messages continue to appear in the Twisted log? ie, is it clear that the logging system is still working, or could it have failed in some way, obscuring any exception reports?
I am out of ideas on what this could be, because my setup is exactly the same as I have been using in the last year, except for a faster CPU and a newer kernel?
I suspect that there are some new uncaught accept() exceptions in internet/tcp.py in the part where it's looking for EMFILE, ENOBUFS, ENFILE, ENOMEM, ECONNABORTED errors.
Any new unhandled errno values should definitely result in an exception being logged (notice that the `raise´ which follows the checks for various errno values is inside a try/except which logs any exception). Jean-Paul
Do non-error log messages continue to appear in the Twisted log? ie, is it clear that the logging system is still working, or could it have failed in some way, obscuring any exception reports?
Yes, I print the results of garbage collection every 10 min like this: print "number of objects tracked by the garbage collector is:", len(gc.get_objects()) and they appear in the main twisted log. 2008/09/10 13:21 -0700 [-] number of objects tracked by the garbage collector is: 860021 2008/09/10 13:31 -0700 [-] number of objects tracked by the garbage collector is: 864316
Any new unhandled errno values should definitely result in an exception being logged (notice that the `raise´ which follows the checks for various errno values is inside a try/except which logs any exception).
I noticed that raise too... Could it then be EWOULDBLOCK, EAGAIN or EPERM? except socket.error, e: if e.args[0] in (EWOULDBLOCK, EAGAIN): self.numberAccepts = i break elif e.args[0] == EPERM: # Netfilter on Linux may have rejected the # connection, but we get told to try to accept() # anyway. continue I am not sure how to debug this problem- I have another twisted server of a different type on that machine, and while the problematic server stops accepting connections, the second one works just fine, so this is not a machine-wide issue. What could it be?
-----Original Message----- From: twisted-python-bounces@twistedmatrix.com [mailto:twisted-python- bounces@twistedmatrix.com] On Behalf Of Jean-Paul Calderone Sent: Wednesday, September 10, 2008 1:44 PM To: Twisted general discussion Subject: Re: [Twisted-Python] intermittent problem: not accepting new connections
I have had a twisted epoll server that was heavily used, such that it saturated CPU (100% shown by "top", about 5000 connections, intense message relaying). I am using twisted 2.5.0 that I patched for epoll bug. It was run on python 2.4.4 , 2.6.11 kernel on a single core xeon 3.0 GHz CPU. This server has been on for many months, and it has been rock- stable.
A couple of days ago I migrated that server to a newer machine: same
twisted 2.5.0, same python 2.4.4, newer 2.6.24 kernel and a quad core xeon L5420 CPU. CPU usage dropped from 100% to 30%, as expected, with the same rate of client connections.
However the server now has the following intermittent problem: about twice a day, it stops accepting new connections for a short period of 5-10 minutes.
telnet times out, I get this: root@serv2:/proc/net/netfilter# telnet localhost 5229
Trying 127.0.0.1...
Existing connections are not cut, they server receives/delivers messages to/from them just fine. These short periods of not accepting connections do not correlate with increased CPU load or with the overall number of connections to the server.
I have had a problem with the same symptoms before, when a server
On Wed, 10 Sep 2008 13:32:06 -0700, Alec Matusis <matusis@yahoo.com> wrote: patched process
run out of its quota of file descriptors. However, there were clear messages in the twisted log at that time, and upping the ulimits solved the problem. This time, there are no errors in ANY logs (twisted log. /var/log/messages, etc)
Do non-error log messages continue to appear in the Twisted log? ie, is it clear that the logging system is still working, or could it have failed in some way, obscuring any exception reports?
I am out of ideas on what this could be, because my setup is exactly
the
same as I have been using in the last year, except for a faster CPU and a newer kernel?
I suspect that there are some new uncaught accept() exceptions in internet/tcp.py in the part where it's looking for EMFILE, ENOBUFS, ENFILE, ENOMEM, ECONNABORTED errors.
Any new unhandled errno values should definitely result in an exception being logged (notice that the `raise´ which follows the checks for various errno values is inside a try/except which logs any exception).
Jean-Paul
_______________________________________________ Twisted-Python mailing list Twisted-Python@twistedmatrix.com http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python
On Wed, 10 Sep 2008 14:01:27 -0700, Alec Matusis <matusis@yahoo.com> wrote:
Do non-error log messages continue to appear in the Twisted log? ie, is it clear that the logging system is still working, or could it have failed in some way, obscuring any exception reports?
Yes, I print the results of garbage collection every 10 min like this:
print "number of objects tracked by the garbage collector is:", len(gc.get_objects())
and they appear in the main twisted log.
2008/09/10 13:21 -0700 [-] number of objects tracked by the garbage collector is: 860021 2008/09/10 13:31 -0700 [-] number of objects tracked by the garbage collector is: 864316
Any new unhandled errno values should definitely result in an exception being logged (notice that the `raise´ which follows the checks for various errno values is inside a try/except which logs any exception).
I noticed that raise too... Could it then be EWOULDBLOCK, EAGAIN or EPERM?
except socket.error, e: if e.args[0] in (EWOULDBLOCK, EAGAIN): self.numberAccepts = i break elif e.args[0] == EPERM: # Netfilter on Linux may have rejected the # connection, but we get told to try to accept() # anyway. continue
I am not sure how to debug this problem- I have another twisted server of a different type on that machine, and while the problematic server stops accepting connections, the second one works just fine, so this is not a machine-wide issue. What could it be?
You could try running strace against the server while it's not accepting connections. You can limit the output to just accept calls, too: strace -p <pid> -etrace=accept Maybe this will show you that accept calls are happening but failing in some way. If it produces no output, you can widen it and see if you can tell what the server is doing if it isn't calling accept(). Jean-Paul
You could try running strace against the server while it's not accepting connections.
I have one short question before I try strace on live: I have been playing with strace this morning on my dev, and I observe that for every second connection, there are two accept calls, one is success and one is a failure: 1st connection: accept(9, {sa_family=AF_INET, sin_port=htons(33375), sin_addr=inet_addr("10.14.0.2")}, [16]) = 22 (client disconnects) 2nd connection: accept(9, {sa_family=AF_INET, sin_port=htons(33519), sin_addr=inet_addr("10.14.0.2")}, [16]) = 22 accept(9, 0xbfaada10, [16]) = -1 EAGAIN (Resource temporarily unavailable) (client disconnects) 3d connection: accept(9, {sa_family=AF_INET, sin_port=htons(33375), sin_addr=inet_addr("10.14.0.2")}, [16]) = 22 (client disconnects) 4th connection: accept(9, {sa_family=AF_INET, sin_port=htons(33519), sin_addr=inet_addr("10.14.0.2")}, [16]) = 22 accept(9, 0xbfaada10, [16]) = -1 EAGAIN (Resource temporarily unavailable) .... and so on. What is the meaning of accept(9, 0xbfaada10, [16]) = -1 EAGAIN (Resource temporarily unavailable) that occurs every second connection?
-----Original Message----- From: twisted-python-bounces@twistedmatrix.com [mailto:twisted-python- bounces@twistedmatrix.com] On Behalf Of Jean-Paul Calderone Sent: Wednesday, September 10, 2008 2:47 PM To: Twisted general discussion Subject: RE: [Twisted-Python] intermittent problem: not accepting new connections
On Wed, 10 Sep 2008 14:01:27 -0700, Alec Matusis <matusis@yahoo.com> wrote:
Do non-error log messages continue to appear in the Twisted log? ie, is it clear that the logging system is still working, or could it have failed in some way, obscuring any exception reports?
Yes, I print the results of garbage collection every 10 min like this:
print "number of objects tracked by the garbage collector is:", len(gc.get_objects())
and they appear in the main twisted log.
2008/09/10 13:21 -0700 [-] number of objects tracked by the garbage
collector is: 860021 >2008/09/10 13:31 -0700 [-] number of objects tracked by the garbage collector is: 864316 > >> Any new unhandled errno values should definitely result in an exception >> being logged (notice that the `raise´ which follows the checks for >> various errno values is inside a try/except which logs any exception). > >I noticed that raise too... Could it then be EWOULDBLOCK, EAGAIN or EPERM? > > except socket.error, e: > if e.args[0] in (EWOULDBLOCK, EAGAIN): > self.numberAccepts = i > break > elif e.args[0] == EPERM: > # Netfilter on Linux may have rejected the > # connection, but we get told to try to accept() > # anyway. > continue > > >I am not sure how to debug this problem- I have another twisted server of a different type on that machine, and while the problematic server stops accepting connections, the second one works just fine, so this is not a machine-wide issue. >What could it be?
You could try running strace against the server while it's not accepting connections. You can limit the output to just accept calls, too:
strace -p <pid> -etrace=accept
Maybe this will show you that accept calls are happening but failing in some way. If it produces no output, you can widen it and see if you can tell what the server is doing if it isn't calling accept().
Jean-Paul
_______________________________________________ Twisted-Python mailing list Twisted-Python@twistedmatrix.com http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python
On Wed, 10 Sep 2008 15:04:40 -0700, Alec Matusis <matusis@yahoo.com> wrote:
You could try running strace against the server while it's not accepting connections.
I have one short question before I try strace on live:
I have been playing with strace this morning on my dev, and I observe that for every second connection, there are two accept calls, one is success and one is a failure:
I'm pretty sure this is a consequence of a heuristic optimization in the accept loop. When one connection is successfully accepted, a counter is incremented. Whenever it tries to accept connections, it will try up to the number of times indicated by the current value of the counter. When it fails to get a new connection, it drops the counter to the number of connections it did accept. So in a lot of cases, it's flipping back and forth between 1 and 2 attempts. The point of this is that if you're receiving lots of connection attempts, it'd be faster to accept many of them at once instead of only accepting one each time around the reactor loop. It's not clear the current implementation is the best way to implement this optimization, but at least the code is very old and hasn't changed in quite a while, so it's probably not actually causing any problems. Jean-Paul
You could try running strace against the server while it's not accepting connections.
It went into this regime again tonight, and I captured strace. A few remarks first: The outage was brief, it lasted about 3 min. It did not correspond to the peak number of connections. The application has several protocols (like manhole and various control ports), that are listening on different ports. When the server stopped accepting connections on the main port (where external internet clients connect), telnet connection to manhole and control ports are still possible. Also, I think when it is in this bad state, it does accept some external connections, maybe 50%. Anyway, here's the strace in the bad state: accept(9, {sa_family=AF_INET, sin_port=htons(1474), sin_addr=inet_addr("83.153.87.198")}, [13980580626831507472]) = 491 accept(13, {sa_family=AF_INET, sin_port=htons(40953), sin_addr=inet_addr("10.18.2.10")}, [17987095445330984976]) = 491 --- SIGCHLD (Child exited) @ 0 (0) --- accept(9, {sa_family=AF_INET, sin_port=htons(50679), sin_addr=inet_addr("24.166.18.207")}, [17853676306370134032]) = 491 accept(9, 0x7fffc3d5ca30, [16]) = -1 EAGAIN (Resource temporarily unavailable) accept(9, {sa_family=AF_INET, sin_port=htons(52873), sin_addr=inet_addr("67.87.61.193")}, [9929874236988456976]) = 493 accept(9, {sa_family=AF_INET, sin_port=htons(54305), sin_addr=inet_addr("200.114.45.212")}, [2437573306904215568]) = 491 accept(9, 0x7fffc3d5ca30, [16]) = -1 EAGAIN (Resource temporarily unavailable) accept(9, {sa_family=AF_INET, sin_port=htons(2246), sin_addr=inet_addr("76.11.66.253")}, [14269655427913351184]) = 491 accept(9, {sa_family=AF_INET, sin_port=htons(32824), sin_addr=inet_addr("75.62.105.81")}, [4071254071732862992]) = 491 accept(9, 0x7fffc3d5ca30, [16]) = -1 EAGAIN (Resource temporarily unavailable) accept(9, {sa_family=AF_INET, sin_port=htons(2262), sin_addr=inet_addr("76.11.66.253")}, [15422576932520198160]) = 491 accept(9, {sa_family=AF_INET, sin_port=htons(3316), sin_addr=inet_addr("24.127.249.90")}, [17585430653564878864]) = 491 accept(9, 0x7fffc3d5ca30, [16]) = -1 EAGAIN (Resource temporarily unavailable) accept(9, {sa_family=AF_INET, sin_port=htons(4466), sin_addr=inet_addr("72.241.229.233")}, [8219350803517800464]) = 491 accept(9, {sa_family=AF_INET, sin_port=htons(1651), sin_addr=inet_addr("67.82.141.97")}, [8288312172811911184]) = 491 accept(9, 0x7fffc3d5ca30, [16]) = -1 EAGAIN (Resource temporarily unavailable) accept(9, {sa_family=AF_INET, sin_port=htons(2669), sin_addr=inet_addr("24.94.231.135")}, [7857092508491186192]) = 398 accept(9, {sa_family=AF_INET, sin_port=htons(50252), sin_addr=inet_addr("97.119.52.164")}, [5531546250907746320]) = 398 accept(9, 0x7fffc3d5ca30, [16]) = -1 EAGAIN (Resource temporarily unavailable) accept(9, {sa_family=AF_INET, sin_port=htons(2675), sin_addr=inet_addr("70.4.222.71")}, [8289438072718753808]) = 398 accept(9, {sa_family=AF_INET, sin_port=htons(63671), sin_addr=inet_addr("69.22.255.174")}, [13256345511754989584]) = 491 accept(9, 0x7fffc3d5ca30, [16]) = -1 EAGAIN (Resource temporarily unavailable) accept(9, {sa_family=AF_INET, sin_port=htons(54520), sin_addr=inet_addr("92.130.148.171")}, [17929956025058721808]) = 491 accept(13, {sa_family=AF_INET, sin_port=htons(41056), sin_addr=inet_addr("10.18.2.10")}, [6962565032504721424]) = 491 accept(13, 0x7fffc3d5ca30, [16]) = -1 EAGAIN (Resource temporarily unavailable) accept(9, {sa_family=AF_INET, sin_port=htons(49296), sin_addr=inet_addr("66.214.238.42")}, [10430336745580003344]) = 161 accept(9, 0x7fffc3d5ca30, [16]) = -1 EAGAIN (Resource temporarily unavailable) --- SIGCHLD (Child exited) @ 0 (0) --- accept(13, {sa_family=AF_INET, sin_port=htons(41199), sin_addr=inet_addr("10.18.2.10")}, [17266800979928416272]) = 161 accept(9, {sa_family=AF_INET, sin_port=htons(49376), sin_addr=inet_addr("68.34.251.86")}, [16194944268614238224]) = 161 --- SIGCHLD (Child exited) @ 0 (0) --- accept(13, {sa_family=AF_INET, sin_port=htons(41497), sin_addr=inet_addr("10.18.2.10")}, [1847038805765259280]) = 161 accept(13, 0x7fffc3d5ca30, [16]) = -1 EAGAIN (Resource temporarily unavailable) --- SIGCHLD (Child exited) @ 0 (0) --- accept(9, {sa_family=AF_INET, sin_port=htons(50013), sin_addr=inet_addr("71.241.76.86")}, [6756243874575810576]) = 161 accept(9, 0x7fffc3d5ca30, [16]) = -1 EAGAIN (Resource temporarily unavailable) --- SIGCHLD (Child exited) @ 0 (0) --- --- SIGCHLD (Child exited) @ 0 (0) --- accept(9, {sa_family=AF_INET, sin_port=htons(2594), sin_addr=inet_addr("24.27.123.223")}, [2452772955646590992]) = 161 accept(9, {sa_family=AF_INET, sin_port=htons(1690), sin_addr=inet_addr("67.242.75.46")}, [11098558340291100688]) = 276 accept(9, 0x7fffc3d5ca30, [16]) = -1 EAGAIN (Resource temporarily unavailable) accept(13, {sa_family=AF_INET, sin_port=htons(41825), sin_addr=inet_addr("10.18.2.10")}, [7035467051472781328]) = 276 accept(13, {sa_family=AF_INET, sin_port=htons(41910), sin_addr=inet_addr("10.18.2.10")}, [13160362544696655888]) = 276 here's the strace in the normal state: accept(9, {sa_family=AF_INET, sin_port=htons(40442), sin_addr=inet_addr("74.140.139.210")}, [18058590089415491600]) = 2218 accept(9, 0x7fffc3d5ca30, [16]) = -1 EAGAIN (Resource temporarily unavailable) accept(9, {sa_family=AF_INET, sin_port=htons(58477), sin_addr=inet_addr("99.194.115.132")}, [7918454053414109200]) = 431 accept(13, {sa_family=AF_INET, sin_port=htons(50878), sin_addr=inet_addr("10.18.2.10")}, [13746674921184952336]) = 356 accept(9, {sa_family=AF_INET, sin_port=htons(1154), sin_addr=inet_addr("190.37.68.45")}, [9368613133427408912]) = 356 accept(9, 0x7fffc3d5ca30, [16]) = -1 EAGAIN (Resource temporarily unavailable) --- SIGCHLD (Child exited) @ 0 (0) --- accept(9, {sa_family=AF_INET, sin_port=htons(1212), sin_addr=inet_addr("125.60.240.211")}, [13547953587627229200]) = 356 accept(9, {sa_family=AF_INET, sin_port=htons(1158), sin_addr=inet_addr("190.37.68.45")}, [9656843509579120656]) = 2223 accept(9, {sa_family=AF_INET, sin_port=htons(52008), sin_addr=inet_addr("24.189.205.204")}, [2939443190379315216]) = 2224 accept(9, {sa_family=AF_INET, sin_port=htons(50808), sin_addr=inet_addr("72.155.68.155")}, [8702643338529996816]) = 2253 accept(9, {sa_family=AF_INET, sin_port=htons(2545), sin_addr=inet_addr("24.10.61.223")}, [17368413446520963088]) = 2263 accept(9, 0x7fffc3d5ca30, [16]) = -1 EAGAIN (Resource temporarily unavailable) accept(9, {sa_family=AF_INET, sin_port=htons(2547), sin_addr=inet_addr("24.10.61.223")}, [17512528634596818960]) = 356 accept(9, 0x7fffc3d5ca30, [16]) = -1 EAGAIN (Resource temporarily unavailable) accept(9, {sa_family=AF_INET, sin_port=htons(52012), sin_addr=inet_addr("24.189.205.204")}, [3227673566531026960]) = 180 accept(9, {sa_family=AF_INET, sin_port=htons(50809), sin_addr=inet_addr("72.155.68.155")}, [8774700932567924752]) = 2224 accept(9, 0x7fffc3d5ca30, [16]) = -1 EAGAIN (Resource temporarily unavailable) accept(9, {sa_family=AF_INET, sin_port=htons(1223), sin_addr=inet_addr("125.60.240.211")}, [14340587122044436496]) = 2253 accept(9, {sa_family=AF_INET, sin_port=htons(2462), sin_addr=inet_addr("206.74.170.35")}, [11387633141372944400]) = 2263 accept(9, 0x7fffc3d5ca30, [16]) = -1 EAGAIN (Resource temporarily unavailable) accept(9, {sa_family=AF_INET, sin_port=htons(2464), sin_addr=inet_addr("206.74.170.35")}, [11531748329448800272]) = 2263 accept(9, {sa_family=AF_INET, sin_port=htons(1645), sin_addr=inet_addr("61.145.215.253")}, [7855966608584343568]) = 2224 accept(9, 0x7fffc3d5ca30, [16]) = -1 EAGAIN (Resource temporarily unavailable) accept(9, {sa_family=AF_INET, sin_port=htons(4172), sin_addr=inet_addr("68.188.149.222")}, [5480880755099828240]) = 1283 accept(9, {sa_family=AF_INET, sin_port=htons(50931), sin_addr=inet_addr("70.129.57.218")}, [17565727405195132944]) = 1283 accept(9, 0x7fffc3d5ca30, [16]) = -1 EAGAIN (Resource temporarily unavailable) accept(9, {sa_family=AF_INET, sin_port=htons(4173), sin_addr=inet_addr("68.188.149.222")}, [5552938349137756176]) = 2224 accept(9, {sa_family=AF_INET, sin_port=htons(1122), sin_addr=inet_addr("76.127.204.185")}, [7062770124213714960]) = 1283 accept(9, 0x7fffc3d5ca30, [16]) = -1 EAGAIN (Resource temporarily unavailable) accept(13, {sa_family=AF_INET, sin_port=htons(51062), sin_addr=inet_addr("10.18.2.10")}, [8558809625430851600]) = 2292 accept(13, 0x7fffc3d5ca30, [16]) = -1 EAGAIN (Resource temporarily unavailable) accept(9, {sa_family=AF_INET, sin_port=htons(1646), sin_addr=inet_addr("61.145.215.253")}, [7928024202622271504]) = 2292 accept(9, {sa_family=AF_INET, sin_port=htons(50932), sin_addr=inet_addr("70.129.57.218")}, [17637784999233060880]) = 2310 accept(9, 0x7fffc3d5ca30, [16]) = -1 EAGAIN (Resource temporarily unavailable) accept(9, {sa_family=AF_INET, sin_port=htons(4001), sin_addr=inet_addr("99.13.103.221")}, [11605494773346992144]) = 507 accept(9, {sa_family=AF_INET, sin_port=htons(4004), sin_addr=inet_addr("99.13.103.221")}, [11821667555460775952]) = 507 accept(9, 0x7fffc3d5ca30, [16]) = -1 EAGAIN (Resource temporarily unavailable) accept(9, {sa_family=AF_INET, sin_port=htons(1131), sin_addr=inet_addr("76.127.204.185")}, [7711288470555066384]) = 1283 accept(9, {sa_family=AF_INET, sin_port=htons(2143), sin_addr=inet_addr("202.156.50.129")}, [6847723242006773776]) = 1894 accept(9, 0x7fffc3d5ca30, [16]) = -1 EAGAIN (Resource temporarily unavailable) accept(9, {sa_family=AF_INET, sin_port=htons(2146), sin_addr=inet_addr("202.156.50.129")}, [7063896024120557584]) = 1894 accept(9, {sa_family=AF_INET, sin_port=htons(1779), sin_addr=inet_addr("222.165.66.136")}, [17511684209666686992]) = 1278 accept(9, 0x7fffc3d5ca30, [16]) = -1 EAGAIN (Resource temporarily unavailable) accept(9, {sa_family=AF_INET, sin_port=htons(61674), sin_addr=inet_addr("99.230.169.52")}, [16929031007875629072]) = 1278 accept(9, {sa_family=AF_INET, sin_port=htons(61679), sin_addr=inet_addr("99.230.169.52")}, [17289318978065268752]) = 1278 accept(9, 0x7fffc3d5ca30, [16]) = -1 EAGAIN (Resource temporarily unavailable) accept(9, {sa_family=AF_INET, sin_port=htons(1780), sin_addr=inet_addr("222.165.66.136")}, [17583741803704614928]) = 1913 accept(9, {sa_family=AF_INET, sin_port=htons(52495), sin_addr=inet_addr("190.40.52.74")}, [1138566289384538128]) = 2310 accept(9, 0x7fffc3d5ca30, [16]) = -1 EAGAIN (Resource temporarily unavailable) accept(13, {sa_family=AF_INET, sin_port=htons(51181), sin_addr=inet_addr("10.18.2.10")}, [17133663315944275984]) = 2310 accept(9, {sa_family=AF_INET, sin_port=htons(60995), sin_addr=inet_addr("88.105.44.33")}, [4894849853588242448]) = 2310 accept(9, {sa_family=AF_INET, sin_port=htons(3636), sin_addr=inet_addr("71.64.15.79")}, [3750935548236136464]) = 2310 accept(9, 0x7fffc3d5ca30, [16]) = -1 EAGAIN (Resource temporarily unavailable) accept(13, {sa_family=AF_INET, sin_port=htons(51273), sin_addr=inet_addr("10.18.2.10")}, [5316499368700805136]) = 2334 accept(13, 0x7fffc3d5ca30, [16]) = -1 EAGAIN (Resource temporarily unavailable) accept(9, {sa_family=AF_INET, sin_port=htons(52498), sin_addr=inet_addr("190.40.52.74")}, [1354739071498321936]) = 2334 accept(9, {sa_family=AF_INET, sin_port=htons(1440), sin_addr=inet_addr("24.21.117.196")}, [11530622429541957648]) = 2334 accept(9, 0x7fffc3d5ca30, [16]) = -1 EAGAIN (Resource temporarily unavailable) accept(9, {sa_family=AF_INET, sin_port=htons(1442), sin_addr=inet_addr("24.21.117.196")}, [11674737617617813520]) = 2334 accept(9, {sa_family=AF_INET, sin_port=htons(60996), sin_addr=inet_addr("88.105.44.33")}, [4966907447626170384]) = 2336 accept(9, 0x7fffc3d5ca30, [16]) = -1 EAGAIN (Resource temporarily unavailable) accept(13, {sa_family=AF_INET, sin_port=htons(51345), sin_addr=inet_addr("10.18.2.10")}, [10504646139431616528]) = 2337 accept(13, {sa_family=AF_INET, sin_port=htons(51348), sin_addr=inet_addr("10.18.2.10")}, [10720818921545400336]) = 2337 The only noticeable difference is the frequency of --- SIGCHLD (Child exited) @ 0 (0) --- but it occurred in the normal state as well. The server sends emails via child subprocesses, so I suspect Child exited may be due to that. If that's the case, the strace output is identical, and I do not understand what's causing the problem at all.
-----Original Message----- From: twisted-python-bounces@twistedmatrix.com [mailto:twisted-python- bounces@twistedmatrix.com] On Behalf Of Jean-Paul Calderone Sent: Wednesday, September 10, 2008 2:47 PM To: Twisted general discussion Subject: RE: [Twisted-Python] intermittent problem: not accepting new connections
On Wed, 10 Sep 2008 14:01:27 -0700, Alec Matusis <matusis@yahoo.com> wrote:
Do non-error log messages continue to appear in the Twisted log? ie, is it clear that the logging system is still working, or could it have failed in some way, obscuring any exception reports?
Yes, I print the results of garbage collection every 10 min like this:
print "number of objects tracked by the garbage collector is:", len(gc.get_objects())
and they appear in the main twisted log.
2008/09/10 13:21 -0700 [-] number of objects tracked by the garbage
collector is: 860021 >2008/09/10 13:31 -0700 [-] number of objects tracked by the garbage collector is: 864316 > >> Any new unhandled errno values should definitely result in an exception >> being logged (notice that the `raise´ which follows the checks for >> various errno values is inside a try/except which logs any exception). > >I noticed that raise too... Could it then be EWOULDBLOCK, EAGAIN or EPERM? > > except socket.error, e: > if e.args[0] in (EWOULDBLOCK, EAGAIN): > self.numberAccepts = i > break > elif e.args[0] == EPERM: > # Netfilter on Linux may have rejected the > # connection, but we get told to try to accept() > # anyway. > continue > > >I am not sure how to debug this problem- I have another twisted server of a different type on that machine, and while the problematic server stops accepting connections, the second one works just fine, so this is not a machine-wide issue. >What could it be?
You could try running strace against the server while it's not accepting connections. You can limit the output to just accept calls, too:
strace -p <pid> -etrace=accept
Maybe this will show you that accept calls are happening but failing in some way. If it produces no output, you can widen it and see if you can tell what the server is doing if it isn't calling accept().
Jean-Paul
_______________________________________________ Twisted-Python mailing list Twisted-Python@twistedmatrix.com http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python
Hi Alec just wondering... How does netstat look like? We ran into problems resource wise with the number of open connections aka file descriptors per process and observed similar problems to yours, also by switching to a new server. How does ulimit -a compare between old and new machine? HTH, Werner Alec Matusis wrote:
You could try running strace against the server while it's not accepting connections.
It went into this regime again tonight, and I captured strace.
A few remarks first:
The outage was brief, it lasted about 3 min. It did not correspond to the peak number of connections. The application has several protocols (like manhole and various control ports), that are listening on different ports. When the server stopped accepting connections on the main port (where external internet clients connect), telnet connection to manhole and control ports are still possible. Also, I think when it is in this bad state, it does accept some external connections, maybe 50%.
Anyway, here's the strace in the bad state:
accept(9, {sa_family=AF_INET, sin_port=htons(1474), sin_addr=inet_addr("83.153.87.198")}, [13980580626831507472]) = 491 accept(13, {sa_family=AF_INET, sin_port=htons(40953), sin_addr=inet_addr("10.18.2.10")}, [17987095445330984976]) = 491 --- SIGCHLD (Child exited) @ 0 (0) --- accept(9, {sa_family=AF_INET, sin_port=htons(50679), sin_addr=inet_addr("24.166.18.207")}, [17853676306370134032]) = 491 accept(9, 0x7fffc3d5ca30, [16]) = -1 EAGAIN (Resource temporarily unavailable) accept(9, {sa_family=AF_INET, sin_port=htons(52873), sin_addr=inet_addr("67.87.61.193")}, [9929874236988456976]) = 493 accept(9, {sa_family=AF_INET, sin_port=htons(54305), sin_addr=inet_addr("200.114.45.212")}, [2437573306904215568]) = 491 accept(9, 0x7fffc3d5ca30, [16]) = -1 EAGAIN (Resource temporarily unavailable) accept(9, {sa_family=AF_INET, sin_port=htons(2246), sin_addr=inet_addr("76.11.66.253")}, [14269655427913351184]) = 491 accept(9, {sa_family=AF_INET, sin_port=htons(32824), sin_addr=inet_addr("75.62.105.81")}, [4071254071732862992]) = 491 accept(9, 0x7fffc3d5ca30, [16]) = -1 EAGAIN (Resource temporarily unavailable) accept(9, {sa_family=AF_INET, sin_port=htons(2262), sin_addr=inet_addr("76.11.66.253")}, [15422576932520198160]) = 491 accept(9, {sa_family=AF_INET, sin_port=htons(3316), sin_addr=inet_addr("24.127.249.90")}, [17585430653564878864]) = 491 accept(9, 0x7fffc3d5ca30, [16]) = -1 EAGAIN (Resource temporarily unavailable) accept(9, {sa_family=AF_INET, sin_port=htons(4466), sin_addr=inet_addr("72.241.229.233")}, [8219350803517800464]) = 491 accept(9, {sa_family=AF_INET, sin_port=htons(1651), sin_addr=inet_addr("67.82.141.97")}, [8288312172811911184]) = 491 accept(9, 0x7fffc3d5ca30, [16]) = -1 EAGAIN (Resource temporarily unavailable) accept(9, {sa_family=AF_INET, sin_port=htons(2669), sin_addr=inet_addr("24.94.231.135")}, [7857092508491186192]) = 398 accept(9, {sa_family=AF_INET, sin_port=htons(50252), sin_addr=inet_addr("97.119.52.164")}, [5531546250907746320]) = 398 accept(9, 0x7fffc3d5ca30, [16]) = -1 EAGAIN (Resource temporarily unavailable) accept(9, {sa_family=AF_INET, sin_port=htons(2675), sin_addr=inet_addr("70.4.222.71")}, [8289438072718753808]) = 398 accept(9, {sa_family=AF_INET, sin_port=htons(63671), sin_addr=inet_addr("69.22.255.174")}, [13256345511754989584]) = 491 accept(9, 0x7fffc3d5ca30, [16]) = -1 EAGAIN (Resource temporarily unavailable) accept(9, {sa_family=AF_INET, sin_port=htons(54520), sin_addr=inet_addr("92.130.148.171")}, [17929956025058721808]) = 491 accept(13, {sa_family=AF_INET, sin_port=htons(41056), sin_addr=inet_addr("10.18.2.10")}, [6962565032504721424]) = 491 accept(13, 0x7fffc3d5ca30, [16]) = -1 EAGAIN (Resource temporarily unavailable) accept(9, {sa_family=AF_INET, sin_port=htons(49296), sin_addr=inet_addr("66.214.238.42")}, [10430336745580003344]) = 161 accept(9, 0x7fffc3d5ca30, [16]) = -1 EAGAIN (Resource temporarily unavailable) --- SIGCHLD (Child exited) @ 0 (0) --- accept(13, {sa_family=AF_INET, sin_port=htons(41199), sin_addr=inet_addr("10.18.2.10")}, [17266800979928416272]) = 161 accept(9, {sa_family=AF_INET, sin_port=htons(49376), sin_addr=inet_addr("68.34.251.86")}, [16194944268614238224]) = 161 --- SIGCHLD (Child exited) @ 0 (0) --- accept(13, {sa_family=AF_INET, sin_port=htons(41497), sin_addr=inet_addr("10.18.2.10")}, [1847038805765259280]) = 161 accept(13, 0x7fffc3d5ca30, [16]) = -1 EAGAIN (Resource temporarily unavailable) --- SIGCHLD (Child exited) @ 0 (0) --- accept(9, {sa_family=AF_INET, sin_port=htons(50013), sin_addr=inet_addr("71.241.76.86")}, [6756243874575810576]) = 161 accept(9, 0x7fffc3d5ca30, [16]) = -1 EAGAIN (Resource temporarily unavailable) --- SIGCHLD (Child exited) @ 0 (0) --- --- SIGCHLD (Child exited) @ 0 (0) --- accept(9, {sa_family=AF_INET, sin_port=htons(2594), sin_addr=inet_addr("24.27.123.223")}, [2452772955646590992]) = 161 accept(9, {sa_family=AF_INET, sin_port=htons(1690), sin_addr=inet_addr("67.242.75.46")}, [11098558340291100688]) = 276 accept(9, 0x7fffc3d5ca30, [16]) = -1 EAGAIN (Resource temporarily unavailable) accept(13, {sa_family=AF_INET, sin_port=htons(41825), sin_addr=inet_addr("10.18.2.10")}, [7035467051472781328]) = 276 accept(13, {sa_family=AF_INET, sin_port=htons(41910), sin_addr=inet_addr("10.18.2.10")}, [13160362544696655888]) = 276
here's the strace in the normal state:
accept(9, {sa_family=AF_INET, sin_port=htons(40442), sin_addr=inet_addr("74.140.139.210")}, [18058590089415491600]) = 2218 accept(9, 0x7fffc3d5ca30, [16]) = -1 EAGAIN (Resource temporarily unavailable) accept(9, {sa_family=AF_INET, sin_port=htons(58477), sin_addr=inet_addr("99.194.115.132")}, [7918454053414109200]) = 431 accept(13, {sa_family=AF_INET, sin_port=htons(50878), sin_addr=inet_addr("10.18.2.10")}, [13746674921184952336]) = 356 accept(9, {sa_family=AF_INET, sin_port=htons(1154), sin_addr=inet_addr("190.37.68.45")}, [9368613133427408912]) = 356 accept(9, 0x7fffc3d5ca30, [16]) = -1 EAGAIN (Resource temporarily unavailable) --- SIGCHLD (Child exited) @ 0 (0) --- accept(9, {sa_family=AF_INET, sin_port=htons(1212), sin_addr=inet_addr("125.60.240.211")}, [13547953587627229200]) = 356 accept(9, {sa_family=AF_INET, sin_port=htons(1158), sin_addr=inet_addr("190.37.68.45")}, [9656843509579120656]) = 2223 accept(9, {sa_family=AF_INET, sin_port=htons(52008), sin_addr=inet_addr("24.189.205.204")}, [2939443190379315216]) = 2224 accept(9, {sa_family=AF_INET, sin_port=htons(50808), sin_addr=inet_addr("72.155.68.155")}, [8702643338529996816]) = 2253 accept(9, {sa_family=AF_INET, sin_port=htons(2545), sin_addr=inet_addr("24.10.61.223")}, [17368413446520963088]) = 2263 accept(9, 0x7fffc3d5ca30, [16]) = -1 EAGAIN (Resource temporarily unavailable) accept(9, {sa_family=AF_INET, sin_port=htons(2547), sin_addr=inet_addr("24.10.61.223")}, [17512528634596818960]) = 356 accept(9, 0x7fffc3d5ca30, [16]) = -1 EAGAIN (Resource temporarily unavailable) accept(9, {sa_family=AF_INET, sin_port=htons(52012), sin_addr=inet_addr("24.189.205.204")}, [3227673566531026960]) = 180 accept(9, {sa_family=AF_INET, sin_port=htons(50809), sin_addr=inet_addr("72.155.68.155")}, [8774700932567924752]) = 2224 accept(9, 0x7fffc3d5ca30, [16]) = -1 EAGAIN (Resource temporarily unavailable) accept(9, {sa_family=AF_INET, sin_port=htons(1223), sin_addr=inet_addr("125.60.240.211")}, [14340587122044436496]) = 2253 accept(9, {sa_family=AF_INET, sin_port=htons(2462), sin_addr=inet_addr("206.74.170.35")}, [11387633141372944400]) = 2263 accept(9, 0x7fffc3d5ca30, [16]) = -1 EAGAIN (Resource temporarily unavailable) accept(9, {sa_family=AF_INET, sin_port=htons(2464), sin_addr=inet_addr("206.74.170.35")}, [11531748329448800272]) = 2263 accept(9, {sa_family=AF_INET, sin_port=htons(1645), sin_addr=inet_addr("61.145.215.253")}, [7855966608584343568]) = 2224 accept(9, 0x7fffc3d5ca30, [16]) = -1 EAGAIN (Resource temporarily unavailable) accept(9, {sa_family=AF_INET, sin_port=htons(4172), sin_addr=inet_addr("68.188.149.222")}, [5480880755099828240]) = 1283 accept(9, {sa_family=AF_INET, sin_port=htons(50931), sin_addr=inet_addr("70.129.57.218")}, [17565727405195132944]) = 1283 accept(9, 0x7fffc3d5ca30, [16]) = -1 EAGAIN (Resource temporarily unavailable) accept(9, {sa_family=AF_INET, sin_port=htons(4173), sin_addr=inet_addr("68.188.149.222")}, [5552938349137756176]) = 2224 accept(9, {sa_family=AF_INET, sin_port=htons(1122), sin_addr=inet_addr("76.127.204.185")}, [7062770124213714960]) = 1283 accept(9, 0x7fffc3d5ca30, [16]) = -1 EAGAIN (Resource temporarily unavailable) accept(13, {sa_family=AF_INET, sin_port=htons(51062), sin_addr=inet_addr("10.18.2.10")}, [8558809625430851600]) = 2292 accept(13, 0x7fffc3d5ca30, [16]) = -1 EAGAIN (Resource temporarily unavailable) accept(9, {sa_family=AF_INET, sin_port=htons(1646), sin_addr=inet_addr("61.145.215.253")}, [7928024202622271504]) = 2292 accept(9, {sa_family=AF_INET, sin_port=htons(50932), sin_addr=inet_addr("70.129.57.218")}, [17637784999233060880]) = 2310 accept(9, 0x7fffc3d5ca30, [16]) = -1 EAGAIN (Resource temporarily unavailable) accept(9, {sa_family=AF_INET, sin_port=htons(4001), sin_addr=inet_addr("99.13.103.221")}, [11605494773346992144]) = 507 accept(9, {sa_family=AF_INET, sin_port=htons(4004), sin_addr=inet_addr("99.13.103.221")}, [11821667555460775952]) = 507 accept(9, 0x7fffc3d5ca30, [16]) = -1 EAGAIN (Resource temporarily unavailable) accept(9, {sa_family=AF_INET, sin_port=htons(1131), sin_addr=inet_addr("76.127.204.185")}, [7711288470555066384]) = 1283 accept(9, {sa_family=AF_INET, sin_port=htons(2143), sin_addr=inet_addr("202.156.50.129")}, [6847723242006773776]) = 1894 accept(9, 0x7fffc3d5ca30, [16]) = -1 EAGAIN (Resource temporarily unavailable) accept(9, {sa_family=AF_INET, sin_port=htons(2146), sin_addr=inet_addr("202.156.50.129")}, [7063896024120557584]) = 1894 accept(9, {sa_family=AF_INET, sin_port=htons(1779), sin_addr=inet_addr("222.165.66.136")}, [17511684209666686992]) = 1278 accept(9, 0x7fffc3d5ca30, [16]) = -1 EAGAIN (Resource temporarily unavailable) accept(9, {sa_family=AF_INET, sin_port=htons(61674), sin_addr=inet_addr("99.230.169.52")}, [16929031007875629072]) = 1278 accept(9, {sa_family=AF_INET, sin_port=htons(61679), sin_addr=inet_addr("99.230.169.52")}, [17289318978065268752]) = 1278 accept(9, 0x7fffc3d5ca30, [16]) = -1 EAGAIN (Resource temporarily unavailable) accept(9, {sa_family=AF_INET, sin_port=htons(1780), sin_addr=inet_addr("222.165.66.136")}, [17583741803704614928]) = 1913 accept(9, {sa_family=AF_INET, sin_port=htons(52495), sin_addr=inet_addr("190.40.52.74")}, [1138566289384538128]) = 2310 accept(9, 0x7fffc3d5ca30, [16]) = -1 EAGAIN (Resource temporarily unavailable) accept(13, {sa_family=AF_INET, sin_port=htons(51181), sin_addr=inet_addr("10.18.2.10")}, [17133663315944275984]) = 2310 accept(9, {sa_family=AF_INET, sin_port=htons(60995), sin_addr=inet_addr("88.105.44.33")}, [4894849853588242448]) = 2310 accept(9, {sa_family=AF_INET, sin_port=htons(3636), sin_addr=inet_addr("71.64.15.79")}, [3750935548236136464]) = 2310 accept(9, 0x7fffc3d5ca30, [16]) = -1 EAGAIN (Resource temporarily unavailable) accept(13, {sa_family=AF_INET, sin_port=htons(51273), sin_addr=inet_addr("10.18.2.10")}, [5316499368700805136]) = 2334 accept(13, 0x7fffc3d5ca30, [16]) = -1 EAGAIN (Resource temporarily unavailable) accept(9, {sa_family=AF_INET, sin_port=htons(52498), sin_addr=inet_addr("190.40.52.74")}, [1354739071498321936]) = 2334 accept(9, {sa_family=AF_INET, sin_port=htons(1440), sin_addr=inet_addr("24.21.117.196")}, [11530622429541957648]) = 2334 accept(9, 0x7fffc3d5ca30, [16]) = -1 EAGAIN (Resource temporarily unavailable) accept(9, {sa_family=AF_INET, sin_port=htons(1442), sin_addr=inet_addr("24.21.117.196")}, [11674737617617813520]) = 2334 accept(9, {sa_family=AF_INET, sin_port=htons(60996), sin_addr=inet_addr("88.105.44.33")}, [4966907447626170384]) = 2336 accept(9, 0x7fffc3d5ca30, [16]) = -1 EAGAIN (Resource temporarily unavailable) accept(13, {sa_family=AF_INET, sin_port=htons(51345), sin_addr=inet_addr("10.18.2.10")}, [10504646139431616528]) = 2337 accept(13, {sa_family=AF_INET, sin_port=htons(51348), sin_addr=inet_addr("10.18.2.10")}, [10720818921545400336]) = 2337
The only noticeable difference is the frequency of --- SIGCHLD (Child exited) @ 0 (0) --- but it occurred in the normal state as well. The server sends emails via child subprocesses, so I suspect Child exited may be due to that.
If that's the case, the strace output is identical, and I do not understand what's causing the problem at all.
-----Original Message----- From: twisted-python-bounces@twistedmatrix.com [mailto:twisted-python- bounces@twistedmatrix.com] On Behalf Of Jean-Paul Calderone Sent: Wednesday, September 10, 2008 2:47 PM To: Twisted general discussion Subject: RE: [Twisted-Python] intermittent problem: not accepting new connections
On Wed, 10 Sep 2008 14:01:27 -0700, Alec Matusis <matusis@yahoo.com> wrote:
Do non-error log messages continue to appear in the Twisted log? ie, is it clear that the logging system is still working, or could it have failed in some way, obscuring any exception reports? Yes, I print the results of garbage collection every 10 min like this:
print "number of objects tracked by the garbage collector is:", len(gc.get_objects()) and they appear in the main twisted log.
2008/09/10 13:21 -0700 [-] number of objects tracked by the garbage
collector is: 860021 > 2008/09/10 13:31 -0700 [-] number of objects tracked by the garbage collector is: 864316 >> Any new unhandled errno values should definitely result in an exception >> being logged (notice that the `raise´ which follows the checks for >> various errno values is inside a try/except which logs any exception). > I noticed that raise too... Could it then be EWOULDBLOCK, EAGAIN or EPERM? > except socket.error, e: > if e.args[0] in (EWOULDBLOCK, EAGAIN): > self.numberAccepts = i > break > elif e.args[0] == EPERM: > # Netfilter on Linux may have rejected the > # connection, but we get told to try to accept() > # anyway. > continue > > > I am not sure how to debug this problem- I have another twisted server of a different type on that machine, and while the problematic server stops accepting connections, the second one works just fine, so this is not a machine-wide issue. > What could it be? You could try running strace against the server while it's not accepting connections. You can limit the output to just accept calls, too:
strace -p <pid> -etrace=accept
Maybe this will show you that accept calls are happening but failing in some way. If it produces no output, you can widen it and see if you can tell what the server is doing if it isn't calling accept().
Jean-Paul
_______________________________________________ Twisted-Python mailing list Twisted-Python@twistedmatrix.com http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python
_______________________________________________ Twisted-Python mailing list Twisted-Python@twistedmatrix.com http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python
Not necessarily related to what you've described, but I'll share something that's helped a good deal on my most-heavily hit twisted servers. Presuming you're using Linux: echo 1 > /proc/sys/net/ipv4/tcp_tw_recycle from http://lartc.org/howto/lartc.kernel.obscure.html : "Enable fast recycling TIME-WAIT sockets. Default value is 1. It should not be changed without advice/request of technical experts" My expert advice: only use this on machines connected on a low-latency LAN. It *will* break internet-facing interfaces. It halves the constant used by the Nagle algorithm: http://en.wikipedia.org/wiki/Nagle's_algorithm Kind regards, /p
On Thu, 11 Sep 2008 14:31:38 +0100, "Paul C. Nendick" <paul.nendick@gmail.com> wrote:
Not necessarily related to what you've described, but I'll share something that's helped a good deal on my most-heavily hit twisted servers. Presuming you're using Linux:
echo 1 > /proc/sys/net/ipv4/tcp_tw_recycle
from http://lartc.org/howto/lartc.kernel.obscure.html :
"Enable fast recycling TIME-WAIT sockets. Default value is 1. It should not be changed without advice/request of technical experts"
My expert advice: only use this on machines connected on a low-latency LAN. It *will* break internet-facing interfaces. It halves the constant used by the Nagle algorithm:
This is somewhat interesting. It suggests a potential problem which I hadn't thought about before. If you need to accept more than about 64k connections (not necessarily concurrent) in less than TIME-WAIT seconds, you might run out of ports. Anyone know what happens to new connection attempts to a server in this condition? Alec, any idea if your server could be getting into this state every once in a while? This is an appealing hypothesis, since it wouldn't necessarily happen at peak connection time (but potentially shortly after a peak), would resolve itself given a short period of time, wouldn't necessarily prevent all new connection attempts, since old TIME-WAIT sockets would be gradually timing out (so your other low- volume servers might still appear to be working normally), wouldn't interfere with already established connections, and might not change the userspace-visible syscall behavior (depending on what Linux does in this case, but I wouldn't be surprised if connection failures due to this never showed up in an accept(2) result). Jean-Paul
Jean-Paul Calderone wrote:
On Thu, 11 Sep 2008 14:31:38 +0100, "Paul C. Nendick" <paul.nendick@gmail.com> wrote:
Not necessarily related to what you've described, but I'll share something that's helped a good deal on my most-heavily hit twisted servers. Presuming you're using Linux:
echo 1 > /proc/sys/net/ipv4/tcp_tw_recycle
from http://lartc.org/howto/lartc.kernel.obscure.html :
"Enable fast recycling TIME-WAIT sockets. Default value is 1. It should not be changed without advice/request of technical experts"
My expert advice: only use this on machines connected on a low-latency LAN. It *will* break internet-facing interfaces. It halves the constant used by the Nagle algorithm:
This is somewhat interesting. It suggests a potential problem which I hadn't thought about before. If you need to accept more than about 64k connections (not necessarily concurrent) in less than TIME-WAIT seconds, you might run out of ports. Anyone know what happens to new connection attempts to a server in this condition?
Why would you run out of ports? accept()ed sockets use the same port as the listen()ing socket. Shortening TIME_WAIT just saves on kernel memory.
On Thu, 11 Sep 2008 15:38:06 +0100, Phil Mayers <p.mayers@imperial.ac.uk> wrote:
Jean-Paul Calderone wrote:
On Thu, 11 Sep 2008 14:31:38 +0100, "Paul C. Nendick" <paul.nendick@gmail.com> wrote:
Not necessarily related to what you've described, but I'll share something that's helped a good deal on my most-heavily hit twisted servers. Presuming you're using Linux:
echo 1 > /proc/sys/net/ipv4/tcp_tw_recycle
from http://lartc.org/howto/lartc.kernel.obscure.html :
"Enable fast recycling TIME-WAIT sockets. Default value is 1. It should not be changed without advice/request of technical experts"
My expert advice: only use this on machines connected on a low-latency LAN. It *will* break internet-facing interfaces. It halves the constant used by the Nagle algorithm:
This is somewhat interesting. It suggests a potential problem which I hadn't thought about before. If you need to accept more than about 64k connections (not necessarily concurrent) in less than TIME-WAIT seconds, you might run out of ports. Anyone know what happens to new connection attempts to a server in this condition?
Why would you run out of ports? accept()ed sockets use the same port as the listen()ing socket.
Uh yes, I guess that would be a fatal flaw in this hypothesis. Woops. :) Jean-Paul
How does ulimit -a compare between old and new machine?
ulimit is only higher on the new machine, I upped it. Also, from my experience, running into ulimits produces log messages. alecm@serv2 ~> cat /proc/23678/limits Limit Soft Limit Hard Limit Units Max cpu time unlimited unlimited ms Max file size unlimited unlimited bytes Max data size unlimited unlimited bytes Max stack size 8388608 unlimited bytes Max core file size 0 unlimited bytes Max resident set 1073741824 1073741824 bytes Max processes 126976 126976 processes Max open files 25000 25000 files Max locked memory 32768 32768 bytes Max address space unlimited unlimited bytes Max file locks unlimited unlimited locks Max pending signals 126976 126976 signals Max msgqueue size 819200 819200 bytes Max nice priority 0 0 Max realtime priority 0 0
If you need to accept more than about 64k connections (not necessarily concurrent) in less than TIME-WAIT seconds, you might run out of ports.
How does netstat look like?
I have to wait for another outage tonight, to get more definite results on this. So far, I have had only one very brief outage at 7:46am I was asleep, but I have some new data logged. I have two servers, one on port 5222 (problematic one, let's call it serv1 ), and one on 5228 (that has no problems, call it serv2 ). I made a script that measures the number of connections on each type of server respectively like this: while [ 1 ] do sleep 5 netstat -ant | grep ESTABLISHED | awk '/:5228 /{a += 1}/:5222 /{b += 1} END {print strftime(" %d %b %Y %H:%M:%S", systime( ))" serv2: "a" serv1: "b" total: "a+b }' done This is what happened around 7:46am (lines that have "+" in front indicate anomaly): 11 Sep 2008 07:41:29 serv2: 2756 serv1: 1064 total: 3820 11 Sep 2008 07:41:36 serv2: 2764 serv1: 1062 total: 3826 11 Sep 2008 07:41:44 serv2: 2784 serv1: 1049 total: 3833 11 Sep 2008 07:41:51 serv2: 2756 serv1: 1055 total: 3811 11 Sep 2008 07:41:59 serv2: 2760 serv1: 1066 total: 3826 11 Sep 2008 07:42:06 serv2: 2777 serv1: 1050 total: 3827 11 Sep 2008 07:42:14 serv2: 2769 serv1: 1054 total: 3823 11 Sep 2008 07:42:21 serv2: 2751 serv1: 1055 total: 3806 11 Sep 2008 07:42:29 serv2: 2760 serv1: 1050 total: 3810 11 Sep 2008 07:42:36 serv2: 2747 serv1: 1040 total: 3787 11 Sep 2008 07:42:44 serv2: 2737 serv1: 1046 total: 3783 11 Sep 2008 07:42:51 serv2: 2739 serv1: 1046 total: 3785 11 Sep 2008 07:42:59 serv2: 2743 serv1: 1037 total: 3780 11 Sep 2008 07:43:07 serv2: 2720 serv1: 1041 total: 3761 11 Sep 2008 07:43:14 serv2: 2714 serv1: 1047 total: 3761 11 Sep 2008 07:43:22 serv2: 2721 serv1: 1045 total: 3766 11 Sep 2008 07:43:29 serv2: 2697 serv1: 1056 total: 3753 11 Sep 2008 07:43:37 serv2: 2710 serv1: 1059 total: 3769 +11 Sep 2008 07:43:44 serv2: 2765 serv1: 1304 total: 4069 +11 Sep 2008 07:43:53 serv2: 2854 serv1: 1904 total: 4758 +11 Sep 2008 07:44:01 serv2: 2714 serv1: 2190 total: 4904 +11 Sep 2008 07:44:09 serv2: 2715 serv1: 2202 total: 4917 +11 Sep 2008 07:44:17 serv2: 2779 serv1: 1891 total: 4670 +11 Sep 2008 07:44:26 serv2: 2812 serv1: 2284 total: 5096 +11 Sep 2008 07:44:36 serv2: 2828 serv1: 3496 total: 6324 +11 Sep 2008 07:44:46 serv2: 2715 serv1: 4327 total: 7042 +11 Sep 2008 07:44:56 serv2: 2638 serv1: 3499 total: 6137 +11 Sep 2008 07:45:05 serv2: 2714 serv1: 2396 total: 5110 +11 Sep 2008 07:45:15 serv2: 2776 serv1: 1464 total: 4240 +11 Sep 2008 07:45:25 serv2: 2728 serv1: 1604 total: 4332 +11 Sep 2008 07:45:35 serv2: 2708 serv1: 1566 total: 4274 +11 Sep 2008 07:45:45 serv2: 2750 serv1: 1680 total: 4430 +11 Sep 2008 07:45:54 serv2: 2755 serv1: 1311 total: 4066 +11 Sep 2008 07:46:04 serv2: 2704 serv1: 1178 total: 3882 11 Sep 2008 07:46:13 serv2: 2644 serv1: 1024 total: 3668 11 Sep 2008 07:46:22 serv2: 2573 serv1: 981 total: 3554 11 Sep 2008 07:46:30 serv2: 2739 serv1: 1051 total: 3790 11 Sep 2008 07:46:39 serv2: 2773 serv1: 1043 total: 3816 11 Sep 2008 07:46:46 serv2: 2548 serv1: 959 total: 3507 11 Sep 2008 07:46:54 serv2: 2773 serv1: 1044 total: 3817 11 Sep 2008 07:47:02 serv2: 2745 serv1: 1027 total: 3772 11 Sep 2008 07:47:10 serv2: 2591 serv1: 960 total: 3551 11 Sep 2008 07:47:17 serv2: 2721 serv1: 1002 total: 3723 11 Sep 2008 07:47:25 serv2: 2753 serv1: 1030 total: 3783 11 Sep 2008 07:47:32 serv2: 2773 serv1: 1029 total: 3802 11 Sep 2008 07:47:40 serv2: 2768 serv1: 1025 total: 3793 11 Sep 2008 07:47:47 serv2: 2764 serv1: 1021 total: 3785 11 Sep 2008 07:47:55 serv2: 2750 serv1: 1021 total: 3771 Each server also reports the maximum numbers of authenticated connected clients to the database. Even though serv1 peaked at 07:44:46 at 4327 "ESTABLISHED" connections, the number of normally authenticated clients in the database for this time period is only 1100, which does not reflect this weird connections spike. Since I was asleep at 7:46am, I did not have a chance to investigate what those connections actually were.
Anyone know what happens to new connection attempts to a server in this condition?
One thing that I noticed yesterday, that when server was in this condition, telnet simply hang, as if no ACK packet was received back: alecm@serv2 ~> telnet localhost 5222 Trying 127.0.0.1... and nothing This behavior or telnet seems to be unique to the case when ACK is not received in the handshake, I verified that with tcpdump. To contrast that, if nothing listens on the destination port for example, telnet gives: alecm@serv2 ~> telnet localhost 50007 Trying 127.0.0.1... telnet: connect to address 127.0.0.1: Connection refused telnet: Unable to connect to remote host: Connection refused which corresponds to RST packet that kernel returns in that case. I tried to write a 5 line mucked up server with accept() not returning any packets, and failed.
-----Original Message----- From: twisted-python-bounces@twistedmatrix.com [mailto:twisted-python- bounces@twistedmatrix.com] On Behalf Of Jean-Paul Calderone Sent: Thursday, September 11, 2008 6:47 AM To: Twisted general discussion Subject: Re: [Twisted-Python] intermittent problem: not accepting new connections
On Thu, 11 Sep 2008 14:31:38 +0100, "Paul C. Nendick" <paul.nendick@gmail.com> wrote:
Not necessarily related to what you've described, but I'll share something that's helped a good deal on my most-heavily hit twisted servers. Presuming you're using Linux:
echo 1 > /proc/sys/net/ipv4/tcp_tw_recycle
from http://lartc.org/howto/lartc.kernel.obscure.html :
"Enable fast recycling TIME-WAIT sockets. Default value is 1. It should not be changed without advice/request of technical experts"
My expert advice: only use this on machines connected on a low-latency LAN. It *will* break internet-facing interfaces. It halves the constant used by the Nagle algorithm:
This is somewhat interesting. It suggests a potential problem which I hadn't thought about before. If you need to accept more than about 64k connections (not necessarily concurrent) in less than TIME-WAIT seconds, you might run out of ports. Anyone know what happens to new connection attempts to a server in this condition?
Alec, any idea if your server could be getting into this state every once in a while? This is an appealing hypothesis, since it wouldn't necessarily happen at peak connection time (but potentially shortly after a peak), would resolve itself given a short period of time, wouldn't necessarily prevent all new connection attempts, since old TIME-WAIT sockets would be gradually timing out (so your other low- volume servers might still appear to be working normally), wouldn't interfere with already established connections, and might not change the userspace-visible syscall behavior (depending on what Linux does in this case, but I wouldn't be surprised if connection failures due to this never showed up in an accept(2) result).
Jean-Paul
_______________________________________________ Twisted-Python mailing list Twisted-Python@twistedmatrix.com http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python
We had a similar problem like this, might as well share it with you guys: Our Jabber server stopped taking in new connections due to our iptables at some point in time (even lost some packets). We have an ip routing scheme where ports 25,80,5223,10873 routed to 5222. Whenever this happens, we get error error from our syslogs: ""ip_conntrack: table full, dropping packet" http://support.imagestream.com/Resolving_ip_conntrack_table_full_Errors.html Might as well check dmesg if you have something like this. What we did is disable iptables and removed its kernel modules. On Fri, Sep 12, 2008 at 3:57 AM, Alec Matusis <matusis@yahoo.com> wrote:
How does ulimit -a compare between old and new machine?
ulimit is only higher on the new machine, I upped it. Also, from my experience, running into ulimits produces log messages.
alecm@serv2 ~> cat /proc/23678/limits
Limit Soft Limit Hard Limit Units
Max cpu time unlimited unlimited ms
Max file size unlimited unlimited bytes
Max data size unlimited unlimited bytes
Max stack size 8388608 unlimited bytes
Max core file size 0 unlimited bytes
Max resident set 1073741824 1073741824 bytes
Max processes 126976 126976 processes Max open files 25000 25000 files
Max locked memory 32768 32768 bytes
Max address space unlimited unlimited bytes
Max file locks unlimited unlimited locks
Max pending signals 126976 126976 signals
Max msgqueue size 819200 819200 bytes
Max nice priority 0 0 Max realtime priority 0 0
If you need to accept more than about 64k connections (not necessarily concurrent) in less than TIME-WAIT seconds, you might run out of ports.
How does netstat look like?
I have to wait for another outage tonight, to get more definite results on this. So far, I have had only one very brief outage at 7:46am I was asleep, but I have some new data logged. I have two servers, one on port 5222 (problematic one, let's call it serv1 ), and one on 5228 (that has no problems, call it serv2 ). I made a script that measures the number of connections on each type of server respectively like this:
while [ 1 ] do sleep 5 netstat -ant | grep ESTABLISHED | awk '/:5228 /{a += 1}/:5222 /{b += 1} END {print strftime(" %d %b %Y %H:%M:%S", systime( ))" serv2: "a" serv1: "b" total: "a+b }' done
This is what happened around 7:46am (lines that have "+" in front indicate anomaly):
11 Sep 2008 07:41:29 serv2: 2756 serv1: 1064 total: 3820 11 Sep 2008 07:41:36 serv2: 2764 serv1: 1062 total: 3826 11 Sep 2008 07:41:44 serv2: 2784 serv1: 1049 total: 3833 11 Sep 2008 07:41:51 serv2: 2756 serv1: 1055 total: 3811 11 Sep 2008 07:41:59 serv2: 2760 serv1: 1066 total: 3826 11 Sep 2008 07:42:06 serv2: 2777 serv1: 1050 total: 3827 11 Sep 2008 07:42:14 serv2: 2769 serv1: 1054 total: 3823 11 Sep 2008 07:42:21 serv2: 2751 serv1: 1055 total: 3806 11 Sep 2008 07:42:29 serv2: 2760 serv1: 1050 total: 3810 11 Sep 2008 07:42:36 serv2: 2747 serv1: 1040 total: 3787 11 Sep 2008 07:42:44 serv2: 2737 serv1: 1046 total: 3783 11 Sep 2008 07:42:51 serv2: 2739 serv1: 1046 total: 3785 11 Sep 2008 07:42:59 serv2: 2743 serv1: 1037 total: 3780 11 Sep 2008 07:43:07 serv2: 2720 serv1: 1041 total: 3761 11 Sep 2008 07:43:14 serv2: 2714 serv1: 1047 total: 3761 11 Sep 2008 07:43:22 serv2: 2721 serv1: 1045 total: 3766 11 Sep 2008 07:43:29 serv2: 2697 serv1: 1056 total: 3753 11 Sep 2008 07:43:37 serv2: 2710 serv1: 1059 total: 3769 +11 Sep 2008 07:43:44 serv2: 2765 serv1: 1304 total: 4069 +11 Sep 2008 07:43:53 serv2: 2854 serv1: 1904 total: 4758 +11 Sep 2008 07:44:01 serv2: 2714 serv1: 2190 total: 4904 +11 Sep 2008 07:44:09 serv2: 2715 serv1: 2202 total: 4917 +11 Sep 2008 07:44:17 serv2: 2779 serv1: 1891 total: 4670 +11 Sep 2008 07:44:26 serv2: 2812 serv1: 2284 total: 5096 +11 Sep 2008 07:44:36 serv2: 2828 serv1: 3496 total: 6324 +11 Sep 2008 07:44:46 serv2: 2715 serv1: 4327 total: 7042 +11 Sep 2008 07:44:56 serv2: 2638 serv1: 3499 total: 6137 +11 Sep 2008 07:45:05 serv2: 2714 serv1: 2396 total: 5110 +11 Sep 2008 07:45:15 serv2: 2776 serv1: 1464 total: 4240 +11 Sep 2008 07:45:25 serv2: 2728 serv1: 1604 total: 4332 +11 Sep 2008 07:45:35 serv2: 2708 serv1: 1566 total: 4274 +11 Sep 2008 07:45:45 serv2: 2750 serv1: 1680 total: 4430 +11 Sep 2008 07:45:54 serv2: 2755 serv1: 1311 total: 4066 +11 Sep 2008 07:46:04 serv2: 2704 serv1: 1178 total: 3882 11 Sep 2008 07:46:13 serv2: 2644 serv1: 1024 total: 3668 11 Sep 2008 07:46:22 serv2: 2573 serv1: 981 total: 3554 11 Sep 2008 07:46:30 serv2: 2739 serv1: 1051 total: 3790 11 Sep 2008 07:46:39 serv2: 2773 serv1: 1043 total: 3816 11 Sep 2008 07:46:46 serv2: 2548 serv1: 959 total: 3507 11 Sep 2008 07:46:54 serv2: 2773 serv1: 1044 total: 3817 11 Sep 2008 07:47:02 serv2: 2745 serv1: 1027 total: 3772 11 Sep 2008 07:47:10 serv2: 2591 serv1: 960 total: 3551 11 Sep 2008 07:47:17 serv2: 2721 serv1: 1002 total: 3723 11 Sep 2008 07:47:25 serv2: 2753 serv1: 1030 total: 3783 11 Sep 2008 07:47:32 serv2: 2773 serv1: 1029 total: 3802 11 Sep 2008 07:47:40 serv2: 2768 serv1: 1025 total: 3793 11 Sep 2008 07:47:47 serv2: 2764 serv1: 1021 total: 3785 11 Sep 2008 07:47:55 serv2: 2750 serv1: 1021 total: 3771
Each server also reports the maximum numbers of authenticated connected clients to the database. Even though serv1 peaked at 07:44:46 at 4327 "ESTABLISHED" connections, the number of normally authenticated clients in the database for this time period is only 1100, which does not reflect this weird connections spike. Since I was asleep at 7:46am, I did not have a chance to investigate what those connections actually were.
Anyone know what happens to new connection attempts to a server in this condition?
One thing that I noticed yesterday, that when server was in this condition, telnet simply hang, as if no ACK packet was received back:
alecm@serv2 ~> telnet localhost 5222 Trying 127.0.0.1...
and nothing
This behavior or telnet seems to be unique to the case when ACK is not received in the handshake, I verified that with tcpdump. To contrast that, if nothing listens on the destination port for example, telnet gives:
alecm@serv2 ~> telnet localhost 50007 Trying 127.0.0.1... telnet: connect to address 127.0.0.1: Connection refused telnet: Unable to connect to remote host: Connection refused
which corresponds to RST packet that kernel returns in that case. I tried to write a 5 line mucked up server with accept() not returning any packets, and failed.
-----Original Message----- From: twisted-python-bounces@twistedmatrix.com [mailto:twisted-python- bounces@twistedmatrix.com] On Behalf Of Jean-Paul Calderone Sent: Thursday, September 11, 2008 6:47 AM To: Twisted general discussion Subject: Re: [Twisted-Python] intermittent problem: not accepting new connections
On Thu, 11 Sep 2008 14:31:38 +0100, "Paul C. Nendick" <paul.nendick@gmail.com> wrote:
Not necessarily related to what you've described, but I'll share something that's helped a good deal on my most-heavily hit twisted servers. Presuming you're using Linux:
echo 1 > /proc/sys/net/ipv4/tcp_tw_recycle
from http://lartc.org/howto/lartc.kernel.obscure.html :
"Enable fast recycling TIME-WAIT sockets. Default value is 1. It should not be changed without advice/request of technical experts"
My expert advice: only use this on machines connected on a low-latency LAN. It *will* break internet-facing interfaces. It halves the constant used by the Nagle algorithm:
http://en.wikipedia.org/wiki/Nagle's_algorithm<http://en.wikipedia.org/wiki/Nagle%27s_algorithm>
This is somewhat interesting. It suggests a potential problem which I hadn't thought about before. If you need to accept more than about 64k connections (not necessarily concurrent) in less than TIME-WAIT seconds, you might run out of ports. Anyone know what happens to new connection attempts to a server in this condition?
Alec, any idea if your server could be getting into this state every once in a while? This is an appealing hypothesis, since it wouldn't necessarily happen at peak connection time (but potentially shortly after a peak), would resolve itself given a short period of time, wouldn't necessarily prevent all new connection attempts, since old TIME-WAIT sockets would be gradually timing out (so your other low- volume servers might still appear to be working normally), wouldn't interfere with already established connections, and might not change the userspace-visible syscall behavior (depending on what Linux does in this case, but I wouldn't be surprised if connection failures due to this never showed up in an accept(2) result).
Jean-Paul
_______________________________________________ Twisted-Python mailing list Twisted-Python@twistedmatrix.com http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python
_______________________________________________ Twisted-Python mailing list Twisted-Python@twistedmatrix.com http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python
How does ulimit -a compare between old and new machine?
If you need to accept more than about 64k connections (not necessarily concurrent) in less than TIME-WAIT seconds, you might run out of ports. How does netstat look like? I have to wait for another outage tonight, to get more definite results on
I wanted to post that I solved the problem, and it's unrelated to Twisted. I noticed that this server had sudden increases in the numbers of ESTABLISHED connections in netstat, but not in the number of authenticated connections reported by twisted to the database. These increases correlated with the connection outages. Creating a shell script that monitors the connection rate, and invoking tcpdump and sorting/counting the source IPs when it increases, showed that this server was pounded twice a day for 3-5min from a small set of IPs with new connections at about 100 conns/sec, that sent some junk data (these connections timed out in twisted, because that junk never contained my line separator, and I am using LineOnlyReceiver protocol). This was hardly an attack against my server, because this new server was assigned an IP that I have never used before- I guess this is just an internet reality ;) Even though this does not look like SYN attacks (few connections were in SYN_RCVD state), for some reason enabling net.ipv4.tcp_syncookies=1 made everything normal. When these bizarre connections began to pile uo, I started seeing "possible SYN flooding on port 5222. Sending cookies." in /var/log/messages , and the connectivity outages stopped I am sorry for the confusion, I was emboldened by finding the epoll bug back in April, but this time Twisted continues to rock, and it's just a stupid Ubuntu's tcp setting. From: twisted-python-bounces@twistedmatrix.com [mailto:twisted-python-bounces@twistedmatrix.com] On Behalf Of Alvin Delagon Sent: Thursday, September 11, 2008 6:17 PM To: Twisted general discussion Subject: Re: [Twisted-Python] intermittent problem: not accepting new connections We had a similar problem like this, might as well share it with you guys: Our Jabber server stopped taking in new connections due to our iptables at some point in time (even lost some packets). We have an ip routing scheme where ports 25,80,5223,10873 routed to 5222. Whenever this happens, we get error error from our syslogs: ""ip_conntrack: table full, dropping packet" http://support.imagestream.com/Resolving_ip_conntrack_table_full_Errors.html Might as well check dmesg if you have something like this. What we did is disable iptables and removed its kernel modules. On Fri, Sep 12, 2008 at 3:57 AM, Alec Matusis <matusis@yahoo.com> wrote: ulimit is only higher on the new machine, I upped it. Also, from my experience, running into ulimits produces log messages. alecm@serv2 ~> cat /proc/23678/limits Limit Soft Limit Hard Limit Units Max cpu time unlimited unlimited ms Max file size unlimited unlimited bytes Max data size unlimited unlimited bytes Max stack size 8388608 unlimited bytes Max core file size 0 unlimited bytes Max resident set 1073741824 1073741824 bytes Max processes 126976 126976 processes Max open files 25000 25000 files Max locked memory 32768 32768 bytes Max address space unlimited unlimited bytes Max file locks unlimited unlimited locks Max pending signals 126976 126976 signals Max msgqueue size 819200 819200 bytes Max nice priority 0 0 Max realtime priority 0 0 this. So far, I have had only one very brief outage at 7:46am I was asleep, but I have some new data logged. I have two servers, one on port 5222 (problematic one, let's call it serv1 ), and one on 5228 (that has no problems, call it serv2 ). I made a script that measures the number of connections on each type of server respectively like this: while [ 1 ] do sleep 5 netstat -ant | grep ESTABLISHED | awk '/:5228 /{a += 1}/:5222 /{b += 1} END {print strftime(" %d %b %Y %H:%M:%S", systime( ))" serv2: "a" serv1: "b" total: "a+b }' done This is what happened around 7:46am (lines that have "+" in front indicate anomaly): 11 Sep 2008 07:41:29 serv2: 2756 serv1: 1064 total: 3820 11 Sep 2008 07:41:36 serv2: 2764 serv1: 1062 total: 3826 11 Sep 2008 07:41:44 serv2: 2784 serv1: 1049 total: 3833 11 Sep 2008 07:41:51 serv2: 2756 serv1: 1055 total: 3811 11 Sep 2008 07:41:59 serv2: 2760 serv1: 1066 total: 3826 11 Sep 2008 07:42:06 serv2: 2777 serv1: 1050 total: 3827 11 Sep 2008 07:42:14 serv2: 2769 serv1: 1054 total: 3823 11 Sep 2008 07:42:21 serv2: 2751 serv1: 1055 total: 3806 11 Sep 2008 07:42:29 serv2: 2760 serv1: 1050 total: 3810 11 Sep 2008 07:42:36 serv2: 2747 serv1: 1040 total: 3787 11 Sep 2008 07:42:44 serv2: 2737 serv1: 1046 total: 3783 11 Sep 2008 07:42:51 serv2: 2739 serv1: 1046 total: 3785 11 Sep 2008 07:42:59 serv2: 2743 serv1: 1037 total: 3780 11 Sep 2008 07:43:07 serv2: 2720 serv1: 1041 total: 3761 11 Sep 2008 07:43:14 serv2: 2714 serv1: 1047 total: 3761 11 Sep 2008 07:43:22 serv2: 2721 serv1: 1045 total: 3766 11 Sep 2008 07:43:29 serv2: 2697 serv1: 1056 total: 3753 11 Sep 2008 07:43:37 serv2: 2710 serv1: 1059 total: 3769 +11 Sep 2008 07:43:44 serv2: 2765 serv1: 1304 total: 4069 +11 Sep 2008 07:43:53 serv2: 2854 serv1: 1904 total: 4758 +11 Sep 2008 07:44:01 serv2: 2714 serv1: 2190 total: 4904 +11 Sep 2008 07:44:09 serv2: 2715 serv1: 2202 total: 4917 +11 Sep 2008 07:44:17 serv2: 2779 serv1: 1891 total: 4670 +11 Sep 2008 07:44:26 serv2: 2812 serv1: 2284 total: 5096 +11 Sep 2008 07:44:36 serv2: 2828 serv1: 3496 total: 6324 +11 Sep 2008 07:44:46 serv2: 2715 serv1: 4327 total: 7042 +11 Sep 2008 07:44:56 serv2: 2638 serv1: 3499 total: 6137 +11 Sep 2008 07:45:05 serv2: 2714 serv1: 2396 total: 5110 +11 Sep 2008 07:45:15 serv2: 2776 serv1: 1464 total: 4240 +11 Sep 2008 07:45:25 serv2: 2728 serv1: 1604 total: 4332 +11 Sep 2008 07:45:35 serv2: 2708 serv1: 1566 total: 4274 +11 Sep 2008 07:45:45 serv2: 2750 serv1: 1680 total: 4430 +11 Sep 2008 07:45:54 serv2: 2755 serv1: 1311 total: 4066 +11 Sep 2008 07:46:04 serv2: 2704 serv1: 1178 total: 3882 11 Sep 2008 07:46:13 serv2: 2644 serv1: 1024 total: 3668 11 Sep 2008 07:46:22 serv2: 2573 serv1: 981 total: 3554 11 Sep 2008 07:46:30 serv2: 2739 serv1: 1051 total: 3790 11 Sep 2008 07:46:39 serv2: 2773 serv1: 1043 total: 3816 11 Sep 2008 07:46:46 serv2: 2548 serv1: 959 total: 3507 11 Sep 2008 07:46:54 serv2: 2773 serv1: 1044 total: 3817 11 Sep 2008 07:47:02 serv2: 2745 serv1: 1027 total: 3772 11 Sep 2008 07:47:10 serv2: 2591 serv1: 960 total: 3551 11 Sep 2008 07:47:17 serv2: 2721 serv1: 1002 total: 3723 11 Sep 2008 07:47:25 serv2: 2753 serv1: 1030 total: 3783 11 Sep 2008 07:47:32 serv2: 2773 serv1: 1029 total: 3802 11 Sep 2008 07:47:40 serv2: 2768 serv1: 1025 total: 3793 11 Sep 2008 07:47:47 serv2: 2764 serv1: 1021 total: 3785 11 Sep 2008 07:47:55 serv2: 2750 serv1: 1021 total: 3771 Each server also reports the maximum numbers of authenticated connected clients to the database. Even though serv1 peaked at 07:44:46 at 4327 "ESTABLISHED" connections, the number of normally authenticated clients in the database for this time period is only 1100, which does not reflect this weird connections spike. Since I was asleep at 7:46am, I did not have a chance to investigate what those connections actually were.
Anyone know what happens to new connection attempts to a server in this condition? One thing that I noticed yesterday, that when server was in this condition, telnet simply hang, as if no ACK packet was received back:
alecm@serv2 ~> telnet localhost 5222 Trying 127.0.0.1... and nothing This behavior or telnet seems to be unique to the case when ACK is not received in the handshake, I verified that with tcpdump. To contrast that, if nothing listens on the destination port for example, telnet gives: alecm@serv2 ~> telnet localhost 50007 Trying 127.0.0.1... telnet: connect to address 127.0.0.1: Connection refused telnet: Unable to connect to remote host: Connection refused which corresponds to RST packet that kernel returns in that case. I tried to write a 5 line mucked up server with accept() not returning any packets, and failed.
-----Original Message----- From: twisted-python-bounces@twistedmatrix.com [mailto:twisted-python- bounces@twistedmatrix.com] On Behalf Of Jean-Paul Calderone Sent: Thursday, September 11, 2008 6:47 AM To: Twisted general discussion Subject: Re: [Twisted-Python] intermittent problem: not accepting new connections
On Thu, 11 Sep 2008 14:31:38 +0100, "Paul C. Nendick" <paul.nendick@gmail.com> wrote:
Not necessarily related to what you've described, but I'll share something that's helped a good deal on my most-heavily hit twisted servers. Presuming you're using Linux:
echo 1 > /proc/sys/net/ipv4/tcp_tw_recycle
from http://lartc.org/howto/lartc.kernel.obscure.html :
"Enable fast recycling TIME-WAIT sockets. Default value is 1. It should not be changed without advice/request of technical experts"
My expert advice: only use this on machines connected on a low-latency LAN. It *will* break internet-facing interfaces. It halves the constant used by the Nagle algorithm:
This is somewhat interesting. It suggests a potential problem which I hadn't thought about before. If you need to accept more than about 64k connections (not necessarily concurrent) in less than TIME-WAIT seconds, you might run out of ports. Anyone know what happens to new connection attempts to a server in this condition?
Alec, any idea if your server could be getting into this state every once in a while? This is an appealing hypothesis, since it wouldn't necessarily happen at peak connection time (but potentially shortly after a peak), would resolve itself given a short period of time, wouldn't necessarily prevent all new connection attempts, since old TIME-WAIT sockets would be gradually timing out (so your other low- volume servers might still appear to be working normally), wouldn't interfere with already established connections, and might not change the userspace-visible syscall behavior (depending on what Linux does in this case, but I wouldn't be surprised if connection failures due to this never showed up in an accept(2) result).
Jean-Paul
_______________________________________________ Twisted-Python mailing list Twisted-Python@twistedmatrix.com http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python
_______________________________________________ Twisted-Python mailing list Twisted-Python@twistedmatrix.com http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python -- http://www.alvinatorsplayground.blogspot.com/
On Mon, 15 Sep 2008 04:43:23 -0700, Alec Matusis <matusis@yahoo.com> wrote:
I wanted to post that I solved the problem, and it's unrelated to Twisted.
[snip]
I am sorry for the confusion, I was emboldened by finding the epoll bug back in April, but this time Twisted continues to rock, and it's just a stupid Ubuntu's tcp setting.
Thanks for the follow up and explaination. It's good to know what was going on here and what the fix was, even though it wasn't anything in Twisted. Jean-Paul
participants (6)
-
Alec Matusis
-
Alvin Delagon
-
Jean-Paul Calderone
-
Paul C. Nendick
-
Phil Mayers
-
Werner Thie