How does ulimit -a compare between old and new machine?
ulimit is only higher on the new machine, I upped it. Also, from my experience, running into ulimits produces log messages. alecm@serv2 ~> cat /proc/23678/limits Limit Soft Limit Hard Limit Units Max cpu time unlimited unlimited ms Max file size unlimited unlimited bytes Max data size unlimited unlimited bytes Max stack size 8388608 unlimited bytes Max core file size 0 unlimited bytes Max resident set 1073741824 1073741824 bytes Max processes 126976 126976 processes Max open files 25000 25000 files Max locked memory 32768 32768 bytes Max address space unlimited unlimited bytes Max file locks unlimited unlimited locks Max pending signals 126976 126976 signals Max msgqueue size 819200 819200 bytes Max nice priority 0 0 Max realtime priority 0 0
If you need to accept more than about 64k connections (not necessarily concurrent) in less than TIME-WAIT seconds, you might run out of ports.
How does netstat look like?
I have to wait for another outage tonight, to get more definite results on this. So far, I have had only one very brief outage at 7:46am I was asleep, but I have some new data logged. I have two servers, one on port 5222 (problematic one, let's call it serv1 ), and one on 5228 (that has no problems, call it serv2 ). I made a script that measures the number of connections on each type of server respectively like this: while [ 1 ] do sleep 5 netstat -ant | grep ESTABLISHED | awk '/:5228 /{a += 1}/:5222 /{b += 1} END {print strftime(" %d %b %Y %H:%M:%S", systime( ))" serv2: "a" serv1: "b" total: "a+b }' done This is what happened around 7:46am (lines that have "+" in front indicate anomaly): 11 Sep 2008 07:41:29 serv2: 2756 serv1: 1064 total: 3820 11 Sep 2008 07:41:36 serv2: 2764 serv1: 1062 total: 3826 11 Sep 2008 07:41:44 serv2: 2784 serv1: 1049 total: 3833 11 Sep 2008 07:41:51 serv2: 2756 serv1: 1055 total: 3811 11 Sep 2008 07:41:59 serv2: 2760 serv1: 1066 total: 3826 11 Sep 2008 07:42:06 serv2: 2777 serv1: 1050 total: 3827 11 Sep 2008 07:42:14 serv2: 2769 serv1: 1054 total: 3823 11 Sep 2008 07:42:21 serv2: 2751 serv1: 1055 total: 3806 11 Sep 2008 07:42:29 serv2: 2760 serv1: 1050 total: 3810 11 Sep 2008 07:42:36 serv2: 2747 serv1: 1040 total: 3787 11 Sep 2008 07:42:44 serv2: 2737 serv1: 1046 total: 3783 11 Sep 2008 07:42:51 serv2: 2739 serv1: 1046 total: 3785 11 Sep 2008 07:42:59 serv2: 2743 serv1: 1037 total: 3780 11 Sep 2008 07:43:07 serv2: 2720 serv1: 1041 total: 3761 11 Sep 2008 07:43:14 serv2: 2714 serv1: 1047 total: 3761 11 Sep 2008 07:43:22 serv2: 2721 serv1: 1045 total: 3766 11 Sep 2008 07:43:29 serv2: 2697 serv1: 1056 total: 3753 11 Sep 2008 07:43:37 serv2: 2710 serv1: 1059 total: 3769 +11 Sep 2008 07:43:44 serv2: 2765 serv1: 1304 total: 4069 +11 Sep 2008 07:43:53 serv2: 2854 serv1: 1904 total: 4758 +11 Sep 2008 07:44:01 serv2: 2714 serv1: 2190 total: 4904 +11 Sep 2008 07:44:09 serv2: 2715 serv1: 2202 total: 4917 +11 Sep 2008 07:44:17 serv2: 2779 serv1: 1891 total: 4670 +11 Sep 2008 07:44:26 serv2: 2812 serv1: 2284 total: 5096 +11 Sep 2008 07:44:36 serv2: 2828 serv1: 3496 total: 6324 +11 Sep 2008 07:44:46 serv2: 2715 serv1: 4327 total: 7042 +11 Sep 2008 07:44:56 serv2: 2638 serv1: 3499 total: 6137 +11 Sep 2008 07:45:05 serv2: 2714 serv1: 2396 total: 5110 +11 Sep 2008 07:45:15 serv2: 2776 serv1: 1464 total: 4240 +11 Sep 2008 07:45:25 serv2: 2728 serv1: 1604 total: 4332 +11 Sep 2008 07:45:35 serv2: 2708 serv1: 1566 total: 4274 +11 Sep 2008 07:45:45 serv2: 2750 serv1: 1680 total: 4430 +11 Sep 2008 07:45:54 serv2: 2755 serv1: 1311 total: 4066 +11 Sep 2008 07:46:04 serv2: 2704 serv1: 1178 total: 3882 11 Sep 2008 07:46:13 serv2: 2644 serv1: 1024 total: 3668 11 Sep 2008 07:46:22 serv2: 2573 serv1: 981 total: 3554 11 Sep 2008 07:46:30 serv2: 2739 serv1: 1051 total: 3790 11 Sep 2008 07:46:39 serv2: 2773 serv1: 1043 total: 3816 11 Sep 2008 07:46:46 serv2: 2548 serv1: 959 total: 3507 11 Sep 2008 07:46:54 serv2: 2773 serv1: 1044 total: 3817 11 Sep 2008 07:47:02 serv2: 2745 serv1: 1027 total: 3772 11 Sep 2008 07:47:10 serv2: 2591 serv1: 960 total: 3551 11 Sep 2008 07:47:17 serv2: 2721 serv1: 1002 total: 3723 11 Sep 2008 07:47:25 serv2: 2753 serv1: 1030 total: 3783 11 Sep 2008 07:47:32 serv2: 2773 serv1: 1029 total: 3802 11 Sep 2008 07:47:40 serv2: 2768 serv1: 1025 total: 3793 11 Sep 2008 07:47:47 serv2: 2764 serv1: 1021 total: 3785 11 Sep 2008 07:47:55 serv2: 2750 serv1: 1021 total: 3771 Each server also reports the maximum numbers of authenticated connected clients to the database. Even though serv1 peaked at 07:44:46 at 4327 "ESTABLISHED" connections, the number of normally authenticated clients in the database for this time period is only 1100, which does not reflect this weird connections spike. Since I was asleep at 7:46am, I did not have a chance to investigate what those connections actually were.
Anyone know what happens to new connection attempts to a server in this condition?
One thing that I noticed yesterday, that when server was in this condition, telnet simply hang, as if no ACK packet was received back: alecm@serv2 ~> telnet localhost 5222 Trying 127.0.0.1... and nothing This behavior or telnet seems to be unique to the case when ACK is not received in the handshake, I verified that with tcpdump. To contrast that, if nothing listens on the destination port for example, telnet gives: alecm@serv2 ~> telnet localhost 50007 Trying 127.0.0.1... telnet: connect to address 127.0.0.1: Connection refused telnet: Unable to connect to remote host: Connection refused which corresponds to RST packet that kernel returns in that case. I tried to write a 5 line mucked up server with accept() not returning any packets, and failed.
-----Original Message----- From: twisted-python-bounces@twistedmatrix.com [mailto:twisted-python- bounces@twistedmatrix.com] On Behalf Of Jean-Paul Calderone Sent: Thursday, September 11, 2008 6:47 AM To: Twisted general discussion Subject: Re: [Twisted-Python] intermittent problem: not accepting new connections
On Thu, 11 Sep 2008 14:31:38 +0100, "Paul C. Nendick" <paul.nendick@gmail.com> wrote:
Not necessarily related to what you've described, but I'll share something that's helped a good deal on my most-heavily hit twisted servers. Presuming you're using Linux:
echo 1 > /proc/sys/net/ipv4/tcp_tw_recycle
from http://lartc.org/howto/lartc.kernel.obscure.html :
"Enable fast recycling TIME-WAIT sockets. Default value is 1. It should not be changed without advice/request of technical experts"
My expert advice: only use this on machines connected on a low-latency LAN. It *will* break internet-facing interfaces. It halves the constant used by the Nagle algorithm:
This is somewhat interesting. It suggests a potential problem which I hadn't thought about before. If you need to accept more than about 64k connections (not necessarily concurrent) in less than TIME-WAIT seconds, you might run out of ports. Anyone know what happens to new connection attempts to a server in this condition?
Alec, any idea if your server could be getting into this state every once in a while? This is an appealing hypothesis, since it wouldn't necessarily happen at peak connection time (but potentially shortly after a peak), would resolve itself given a short period of time, wouldn't necessarily prevent all new connection attempts, since old TIME-WAIT sockets would be gradually timing out (so your other low- volume servers might still appear to be working normally), wouldn't interfere with already established connections, and might not change the userspace-visible syscall behavior (depending on what Linux does in this case, but I wouldn't be surprised if connection failures due to this never showed up in an accept(2) result).
Jean-Paul
_______________________________________________ Twisted-Python mailing list Twisted-Python@twistedmatrix.com http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python