[Twisted-Python] carbon-relay eating CPU - EAGAIN (Resource temporarily unavailable)?

- Python 2.7.3 - [twisted, version 13.1.0] - xen-domU
`atop` shows that `carbon-relay` is eating 80, 90% USRCPU. From the `strace`:
accept(7, {sa_family=AF_INET, sin_port=htons(60649), sin_addr=inet_addr("192.237.222.81")}, [16]) = 257 accept(7, {sa_family=AF_INET, sin_port=htons(51564), sin_addr=inet_addr("166.78.1.48")}, [16]) = 257 accept(7, 0x7ffff4679550, [16]) = -1 EAGAIN (Resource temporarily unavailable) accept(7, {sa_family=AF_INET, sin_port=htons(33654), sin_addr=inet_addr("198.61.194.248")}, [16]) = 257 accept(7, {sa_family=AF_INET, sin_port=htons(50037), sin_addr=inet_addr("166.78.181.204")}, [16]) = 257 accept(7, 0x7ffff4679550, [16]) = -1 EAGAIN (Resource temporarily unavailable)
The strange thing is: even restart the service, it seems stuck at fd 7 everytime running `strace`. Does it mean this fd is not being cleanup properly?
I have increased the number of open files:
**/proc/2891/limits**
Limit Soft Limit Hard Limit Units Max cpu time unlimited unlimited seconds Max file size unlimited unlimited bytes Max data size unlimited unlimited bytes Max stack size 8388608 unlimited bytes Max core file size 0 unlimited bytes Max resident set unlimited unlimited bytes Max processes 15834 15834 processes Max open files 16384 16384 files Max locked memory 65536 65536 bytes Max address space unlimited unlimited bytes Max file locks unlimited unlimited locks Max pending signals 15834 15834 signals Max msgqueue size 819200 819200 bytes Max nice priority 0 0 Max realtime priority 0 0 Max realtime timeout unlimited unlimited us
then it decreases down to ~ 50%.
My problem looks like similar to this [thread]( http://twistedmatrix.com/pipermail/twisted-python/2008-September/018361.html) but since we have a few sockets in TIME_WAIT state, I don't think that enable the `tw_recycle` can help. About the `tcp_syncookies`, I don't see any related message in the syslog.
This is what I get when trying to start `carbon-relay` in debug mode:
26/11/2013 02:22:14 :: [listener] MetricPickleReceiver connection with 50.56.249.127:48772 lost: Connection to the other side was lost in a non-clean fashion: Connection lost. 26/11/2013 02:22:14 :: [listener] MetricPickleReceiver connection with 198.101.241.101:50672 lost: Connection to the other side was lost in a non-clean fashion: Connection lost. 26/11/2013 02:22:14 :: [listener] MetricPickleReceiver connection with 166.78.2.167:43346 lost: Connection to the other side was lost in a non-clean fashion: Connection lost.
This is from `twisted`:
class ConnectionLost(ConnectionClosed): """Connection to the other side was lost in a non-clean fashion"""
def __str__(self): s = self.__doc__.strip().splitlines()[0] if self.args: s = '%s: %s' % (s, ' '.join(self.args)) s = '%s.' % s return s
I also have tried to [debug with `gdb`]( https://wiki.python.org/moin/DebuggingWithGdb) but `pystack` returns nothing.

On 26/11/13 04:05, Quan Tong Anh wrote:
- Python 2.7.3
- [twisted, version 13.1.0]
- xen-domU
I assume xen-domU means "Linux" as the guest OS; which kernel version, out of interest.
`atop` shows that `carbon-relay` is eating 80, 90% USRCPU. From the
I don't know about anyone else, but I don't know what carbon-relay is. A quick google suggests it's part of that graphite monitoring/RRD replacement thing; which version are you running (Or: what reactor is it using)?
How many open files do you have?
`strace`:
Without timestamps it's difficult to interpret this unambiguously; that could be a microsecond or minute we're seeing ;o)
The strange thing is: even restart the service, it seems stuck at fd 7 everytime running `strace`. Does it mean this fd is not being cleanup properly?
No. fd numbers are allocated starting from 0 and going upwards. The same process running the same startup code every time will tend (not always, since it's order-dependent) to use the same file descriptors for the same things, modulo close/open events.

On Tue, Nov 26, 2013 at 6:51 PM, Phil Mayers p.mayers@imperial.ac.ukwrote:
I assume xen-domU means "Linux" as the guest OS; which kernel version, out of interest.
3.2.0-53-virtual
I don't know about anyone else, but I don't know what carbon-relay is. A quick google suggests it's part of that graphite monitoring/RRD replacement thing; which version are you running (Or: what reactor is it using)?
Yes, it's here: https://graphite.readthedocs.org/en/latest/carbon-daemons.html#carbon-relay-...
The version I'm using is 0.9.12.
How many open files do you have?
1664.
`strace`:
Without timestamps it's difficult to interpret this unambiguously; that could be a microsecond or minute we're seeing ;o)
Here's for you:
13:51:31 accept(7, {sa_family=AF_INET, sin_port=htons(57236), sin_addr=inet_addr("198.61.194.221")}, [16]) = 216
13:51:31 accept(7, {sa_family=AF_INET, sin_port=htons(39755), sin_addr=inet_addr("50.56.175.61")}, [16]) = 216
13:51:31 accept(7, 0x7fff2eebc270, [16]) = -1 EAGAIN (Resource temporarily unavailable)
13:51:31 accept(7, {sa_family=AF_INET, sin_port=htons(49236), sin_addr=inet_addr("198.101.238.239")}, [16]) = 216
13:51:31 accept(7, {sa_family=AF_INET, sin_port=htons(49639), sin_addr=inet_addr("166.78.2.103")}, [16]) = 216 13:51:31 accept(7, 0x7fff2eebc270, [16]) = -1 EAGAIN (Resource temporarily unavailable) ...
There are about 20 to 30 connections per second.
Twisted-Python mailing list Twisted-Python@twistedmatrix.com http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python

Hi Quan Tong Anh.
On Mon, Nov 25, 2013 at 10:05 PM, Quan Tong Anh anhquankitty@gmail.comwrote:
- Python 2.7.3
- [twisted, version 13.1.0]
- xen-domU
`atop` shows that `carbon-relay` is eating 80, 90% USRCPU. From the `strace`:
accept(7, {sa_family=AF_INET, sin_port=htons(60649),
sin_addr=inet_addr("192.237.222.81")}, [16]) = 257 accept(7, {sa_family=AF_INET, sin_port=htons(51564), sin_addr=inet_addr("166.78.1.48")}, [16]) = 257 accept(7, 0x7ffff4679550, [16]) = -1 EAGAIN (Resource temporarily unavailable) accept(7, {sa_family=AF_INET, sin_port=htons(33654), sin_addr=inet_addr("198.61.194.248")}, [16]) = 257 accept(7, {sa_family=AF_INET, sin_port=htons(50037), sin_addr=inet_addr("166.78.181.204")}, [16]) = 257 accept(7, 0x7ffff4679550, [16]) = -1 EAGAIN (Resource temporarily unavailable)
The strange thing is: even restart the service, it seems stuck at fd 7 everytime running `strace`. Does it mean this fd is not being cleanup properly?
You have given a lot of data, but haven't actually described a problem. Nothing is "stuck" at FD 7, it's simply a listening socket file descriptor, and the way listening sockets work is you call accept() repeatedly on them to get incoming connections.
accept() returning EAGAIN is normal behavior and should basically be ignored. Twisted handles this case appropriately. From the man page:
EAGAIN or EWOULDBLOCK The socket is marked nonblocking and no connections are present to be accepted. POSIX.1-2001 allows either error to be returned for this case, and does not require these constants to have the same value, so a portable application should check for both possibilities.
Is there actually something wrong with your service? Can you explain the behavior you're seeing and how it differs from the behavior you want to see? If the problem is only "carbon uses a lot of CPU", maybe the problem is either just that carbon is inefficient or you need to scale up to more instances?
participants (3)
-
Christopher Armstrong
-
Phil Mayers
-
Quan Tong Anh