[Twisted-Python] debugging a memory leak
I modified a tolerably leaking (about 40MB/day) Twisted server: when the new code was pushed, the memory leak became catastrophic (about 100MB/hr). We could tolerate 40MB/day, but the new code needs to be debugged. First, I reverted to the old version, that is leaking 40MB/day (The leak rate is actually proportional to the number of new connections per second, (which correlates with the CPU utilization of the process): if CPU as measured by $top jumps to >90%, the leak can accelerate to 50MB/hr) I took two steps to debug the leak: 1) Using guppy/heapy via manhole:
hp = hpy() h = hp.heap() h Partition of a set of 1157084 objects. Total size = 140911144 bytes. Index Count % Size % Cumulative % Kind (class / dict of class) 0 785798 68 48463680 34 48463680 34 str 1 7357 1 24660664 18 73124344 52 dict of service.TagSession 2 11735 1 12298280 9 85422624 61 dict of twisted.internet.base.DelayedCall 3 7377 1 7731096 5 93153720 66 dict of twisted.internet.tcp.Server 4 7375 1 7729000 5 100882720 72 dict of protocols.TagProtocol 5 30925 3 7174600 5 108057320 77 __builtin__.set 6 9193 1 6373336 5 114430656 81 dict (no owner) 7 15557 1 3396904 2 117827560 84 list 8 44833 4 3227976 2 121055536 86 types.BuiltinFunctionType 9 38142 3 3051360 2 124106896 88 types.MethodType <328 more rows. Type e.g. '_.more' to view.>
Note that the total size of all objects is 140911144 bytes The 1st, 3d and 4th items in this list show the actual number of connected clients. I wait for about 48 hrs, and then execute the same command, and I see approximately the same Total size, of 130MB. So the total size that Heapy sees via the manhole is stable, fluctuating around 140MB. The problem is that the total RSS size of the process visible by the OS is much larger, it is 871680KB = 851MB: $ps -o pid,vsz,rss,sz,size -p 11236 PID VSZ RSS SZ SZ 11236 1303180 871680 325795 1174388 It is this total RSS size that keeps leaking at 40MB per day. As far as I understand, this means that this is not the problem with purely Python code, since Heapy shows that the total size of all Python objects is more or less constant. Is this a correct assumption? 2) So now I turn to valgrind. I am no expert in using valgrind, so what I did was based only on general logic/rough guesses. Since I cannot run this under valgrind on a production machine due to performance reasons, I recompile python on the staging machine: ./configure --enable-shared --without-pymalloc --prefix=/nail/encap/python-2.6.4-valgrind I then follow the instructions in http://svn.python.org/projects/python/trunk/Misc/README.valgrind Then I run twistd process like this: valgrind --tool=memcheck --suppressions=/nail/sys/src/Python-2.6.4/Misc/valgrind-python.supp --leak-check=full --log-file=/tmp/valgrind.log /usr/local/bin/twistd --no_save --reactor=epoll --pidfile=logs/tagserv.pid --logfile=logs/tagserv.log --python=tagserv.py The memory for the process shown by the OS is now 5x normal, and the performance is about 5x worse, since it's running inside valgrind's synthetic CPU. Because this is done on the staging box where I cannot accurately reproduce the real load, the memory leaks from simulated load seen by $ps -o rss are pretty small, about 1 to 10MB. Still, I am interested in finding out what they are. Now I encounter the problems with my understanding of how to use valgrind. This may not be the appropriate list for this, but perhaps someone could recognize the problem. When I start the server, about 240 lines is written to valgrind log file, --log-file=/tmp/valgrind.log When I shut it down, it adds another 100 lines. No matter what I do in between, it always results in the log file with exactly 343 lines. I can have server runs with a leak of 1MB, or with 10MB, but in the end, I get pretty much the same log file. Moreover, when look for all lost memory reports: $grep lost valgrind.log.1 17,352 bytes in 31 blocks are possibly lost in loss record 49 of 62 203,312 bytes in 478 blocks are possibly lost in loss record 57 of 62 definitely lost: 0 bytes in 0 blocks. possibly lost: 220,664 bytes in 509 blocks. 64 bytes in 2 blocks are definitely lost in loss record 12 of 63 17,352 bytes in 31 blocks are possibly lost in loss record 50 of 63 203,824 bytes in 479 blocks are possibly lost in loss record 58 of 63 definitely lost: 64 bytes in 2 blocks. possibly lost: 221,176 bytes in 510 blocks. 47 bytes in 1 blocks are definitely lost in loss record 8 of 63 128 bytes in 4 blocks are definitely lost in loss record 16 of 63 584 (104 direct, 480 indirect) bytes in 2 blocks are definitely lost in loss record 26 of 63 1,008 bytes in 6 blocks are definitely lost in loss record 31 of 63 22,296 bytes in 41 blocks are possibly lost in loss record 50 of 63 183,368 bytes in 381 blocks are possibly lost in loss record 59 of 63 definitely lost: 1,287 bytes in 13 blocks. indirectly lost: 480 bytes in 20 blocks. possibly lost: 205,664 bytes in 422 blocks. If I add up all those numbers, I get less than 1MB. How do I track down the 10MB leak? Are there any alternative strategies in finding this leak?
Hi Alec ...and they promised you that with a gc'ed language there will never be a memory problem again, you just plain forget about it. I was stuck in the same position as you and after lots of probing the following attempt helped a lot to correct what was later proofed to be overly optimistic coding by holding on to objects for performance/practical reasons in other objects. Producing non collect able cycles in twisted is probably as easy as to forget about memory when you have Alzheimer. Proofing and working on the problem was only possible on the production machine under real load situations. I went ahead and created a manhole service on the production server, allowing me to peek at the python object space without disturbing it too much. What I used as a tool was the code you find later on included. After cleaning all the self produced cycles out our servers processes stabilized at roughly 280 to 320 MB per process and are now running stable for months with more than 20k logins per day and a usual time of connect per user on the average of 25 minutes playing games delivered by nevow/athena LivePages. As I said before, all cycles I found in our SW were introduced by patterns like def beforeRender(self, ctx): self.session = inevow.ISession(ctx) The included code helps to identify the amount of objects being around. Although it's a primitive tool it shines the light where its needed and if you see certain object counts run away then you have at least identified the surrounding where the non collect able cycles are built. Why didn't I use heapy/guppy and found out that way? I wasn't able to find the evidence for what I was suspecting with all the tools I tried (and boy I tried for WEEKS). Avid users of heapy will most probably disagree and tell me it would have been easy. But in a situation as this everything that works to move you out of that pothole you're in is the right thing to do. HTH, Werner exc = [ "function", "type", "list", "dict", "tuple", "wrapper_descriptor", "module", "method_descriptor", "member_descriptor", "instancemethod", "builtin_function_or_method", "frame", "classmethod", "classmethod_descriptor", "_Environ", "MemoryError", "_Printer", "_Helper", "getset_descriptor", "weakreaf" ] inc = [ 'myFirstSuspect', 'mySecondSuspect' ] prev = {} def dumpObjects(delta=True, limit=0, include=inc, exclude=[]): global prev if include != [] and exclude != []: print 'cannot use include and exclude at the same time' return print 'working with:' print ' delta: ', delta print ' limit: ', limit print ' include: ', include print ' exclude: ', exclude objects = {} gc.collect() oo = gc.get_objects() for o in oo: if getattr(o, "__class__", None): name = o.__class__.__name__ if ((exclude == [] and include == []) or \ (exclude != [] and name not in exclude) or \ (include != [] and name in include)): objects[name] = objects.get(name, 0) + 1 ## if more: ## print o pk = prev.keys() pk.sort() names = objects.keys() names.sort() for name in names: if limit == 0 or objects[name] > limit: if not prev.has_key(name): prev[name] = objects[name] dt = objects[name] - prev[name] if delta or dt != 0: print '%0.6d -- %0.6d -- ' % (dt, objects[name]), name prev[name] = objects[name] def getObjects(oname): """ gets an object list with all the named objects out of the sea of gc'ed objects """ olist = [] objects = {} gc.collect() oo = gc.get_objects() for o in oo: if getattr(o, "__class__", None): name = o.__class__.__name__ if (name == oname): olist.append(o) return olist dumpObject = dumpobj.dumpObj Alec Matusis wrote:
I modified a tolerably leaking (about 40MB/day) Twisted server: when the new code was pushed, the memory leak became catastrophic (about 100MB/hr). We could tolerate 40MB/day, but the new code needs to be debugged. First, I reverted to the old version, that is leaking 40MB/day (The leak rate is actually proportional to the number of new connections per second, (which correlates with the CPU utilization of the process): if CPU as measured by $top jumps to >90%, the leak can accelerate to 50MB/hr) I took two steps to debug the leak:
1) Using guppy/heapy via manhole:
hp = hpy() h = hp.heap() h Partition of a set of 1157084 objects. Total size = 140911144 bytes. Index Count % Size % Cumulative % Kind (class / dict of class) 0 785798 68 48463680 34 48463680 34 str 1 7357 1 24660664 18 73124344 52 dict of service.TagSession 2 11735 1 12298280 9 85422624 61 dict of twisted.internet.base.DelayedCall 3 7377 1 7731096 5 93153720 66 dict of twisted.internet.tcp.Server 4 7375 1 7729000 5 100882720 72 dict of protocols.TagProtocol 5 30925 3 7174600 5 108057320 77 __builtin__.set 6 9193 1 6373336 5 114430656 81 dict (no owner) 7 15557 1 3396904 2 117827560 84 list 8 44833 4 3227976 2 121055536 86 types.BuiltinFunctionType 9 38142 3 3051360 2 124106896 88 types.MethodType <328 more rows. Type e.g. '_.more' to view.>
Note that the total size of all objects is 140911144 bytes The 1st, 3d and 4th items in this list show the actual number of connected clients. I wait for about 48 hrs, and then execute the same command, and I see approximately the same Total size, of 130MB. So the total size that Heapy sees via the manhole is stable, fluctuating around 140MB.
The problem is that the total RSS size of the process visible by the OS is much larger, it is 871680KB = 851MB: $ps -o pid,vsz,rss,sz,size -p 11236 PID VSZ RSS SZ SZ 11236 1303180 871680 325795 1174388
It is this total RSS size that keeps leaking at 40MB per day.
As far as I understand, this means that this is not the problem with purely Python code, since Heapy shows that the total size of all Python objects is more or less constant. Is this a correct assumption?
2) So now I turn to valgrind. I am no expert in using valgrind, so what I did was based only on general logic/rough guesses. Since I cannot run this under valgrind on a production machine due to performance reasons, I recompile python on the staging machine: ./configure --enable-shared --without-pymalloc --prefix=/nail/encap/python-2.6.4-valgrind I then follow the instructions in http://svn.python.org/projects/python/trunk/Misc/README.valgrind Then I run twistd process like this:
valgrind --tool=memcheck --suppressions=/nail/sys/src/Python-2.6.4/Misc/valgrind-python.supp --leak-check=full --log-file=/tmp/valgrind.log /usr/local/bin/twistd --no_save --reactor=epoll --pidfile=logs/tagserv.pid --logfile=logs/tagserv.log --python=tagserv.py
The memory for the process shown by the OS is now 5x normal, and the performance is about 5x worse, since it's running inside valgrind's synthetic CPU. Because this is done on the staging box where I cannot accurately reproduce the real load, the memory leaks from simulated load seen by $ps -o rss are pretty small, about 1 to 10MB. Still, I am interested in finding out what they are.
Now I encounter the problems with my understanding of how to use valgrind. This may not be the appropriate list for this, but perhaps someone could recognize the problem. When I start the server, about 240 lines is written to valgrind log file, --log-file=/tmp/valgrind.log When I shut it down, it adds another 100 lines. No matter what I do in between, it always results in the log file with exactly 343 lines. I can have server runs with a leak of 1MB, or with 10MB, but in the end, I get pretty much the same log file. Moreover, when look for all lost memory reports: $grep lost valgrind.log.1 17,352 bytes in 31 blocks are possibly lost in loss record 49 of 62 203,312 bytes in 478 blocks are possibly lost in loss record 57 of 62 definitely lost: 0 bytes in 0 blocks. possibly lost: 220,664 bytes in 509 blocks. 64 bytes in 2 blocks are definitely lost in loss record 12 of 63 17,352 bytes in 31 blocks are possibly lost in loss record 50 of 63 203,824 bytes in 479 blocks are possibly lost in loss record 58 of 63 definitely lost: 64 bytes in 2 blocks. possibly lost: 221,176 bytes in 510 blocks. 47 bytes in 1 blocks are definitely lost in loss record 8 of 63 128 bytes in 4 blocks are definitely lost in loss record 16 of 63 584 (104 direct, 480 indirect) bytes in 2 blocks are definitely lost in loss record 26 of 63 1,008 bytes in 6 blocks are definitely lost in loss record 31 of 63 22,296 bytes in 41 blocks are possibly lost in loss record 50 of 63 183,368 bytes in 381 blocks are possibly lost in loss record 59 of 63 definitely lost: 1,287 bytes in 13 blocks. indirectly lost: 480 bytes in 20 blocks. possibly lost: 205,664 bytes in 422 blocks.
If I add up all those numbers, I get less than 1MB. How do I track down the 10MB leak?
Are there any alternative strategies in finding this leak?
_______________________________________________ Twisted-Python mailing list Twisted-Python@twistedmatrix.com http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python
Thank you Werner! I am playing with your object counter. But before I can interpret the results, I have to ask a possibly stupid question. Can someone explain to me this behavior of python 2.6 on x86 linux 2.6.24: 1) start interpreter Check the size of the interpreter process: $ps -orss -p24316 RSS 4212 Size is 4MB, fine. 2) Allocate a huge chunk of memory:
x=range(1,10000000) Check the size of the interpreter process: $ps -orss -p24316 RSS 322460
The RSS memory grew by 322460KB =314MB, as I expected 3) Delete this object:
del x Check the size of the interpreter process: $ps -orss -p24316 RSS 244332
So, the memory is far off its original value! It's up by nearly 240MB, and it never gets released. Why is that? 4)
import gc gc.collect() 0 $ps -orss -p24316 RSS 244404 Not much luck here!
-----Original Message----- From: twisted-python-bounces@twistedmatrix.com [mailto:twisted-python- bounces@twistedmatrix.com] On Behalf Of Werner Thie Sent: Friday, February 19, 2010 4:10 PM To: Twisted general discussion Subject: Re: [Twisted-Python] debugging a memory leak
Hi Alec
...and they promised you that with a gc'ed language there will never be a memory problem again, you just plain forget about it.
I was stuck in the same position as you and after lots of probing the following attempt helped a lot to correct what was later proofed to be overly optimistic coding by holding on to objects for performance/practical reasons in other objects. Producing non collect able cycles in twisted is probably as easy as to forget about memory when you have Alzheimer.
Proofing and working on the problem was only possible on the production machine under real load situations. I went ahead and created a manhole service on the production server, allowing me to peek at the python object space without disturbing it too much. What I used as a tool was the code you find later on included.
After cleaning all the self produced cycles out our servers processes stabilized at roughly 280 to 320 MB per process and are now running stable for months with more than 20k logins per day and a usual time of connect per user on the average of 25 minutes playing games delivered by nevow/athena LivePages.
As I said before, all cycles I found in our SW were introduced by patterns like
def beforeRender(self, ctx): self.session = inevow.ISession(ctx)
The included code helps to identify the amount of objects being around. Although it's a primitive tool it shines the light where its needed and if you see certain object counts run away then you have at least identified the surrounding where the non collect able cycles are built.
Why didn't I use heapy/guppy and found out that way? I wasn't able to find the evidence for what I was suspecting with all the tools I tried (and boy I tried for WEEKS). Avid users of heapy will most probably disagree and tell me it would have been easy. But in a situation as this everything that works to move you out of that pothole you're in is the right thing to do.
HTH, Werner
exc = [ "function", "type", "list", "dict", "tuple", "wrapper_descriptor", "module", "method_descriptor", "member_descriptor", "instancemethod", "builtin_function_or_method", "frame", "classmethod", "classmethod_descriptor", "_Environ", "MemoryError", "_Printer", "_Helper", "getset_descriptor", "weakreaf" ]
inc = [ 'myFirstSuspect', 'mySecondSuspect' ]
prev = {}
def dumpObjects(delta=True, limit=0, include=inc, exclude=[]): global prev if include != [] and exclude != []: print 'cannot use include and exclude at the same time' return print 'working with:' print ' delta: ', delta print ' limit: ', limit print ' include: ', include print ' exclude: ', exclude objects = {} gc.collect() oo = gc.get_objects() for o in oo: if getattr(o, "__class__", None): name = o.__class__.__name__ if ((exclude == [] and include == []) or \ (exclude != [] and name not in exclude) or \ (include != [] and name in include)): objects[name] = objects.get(name, 0) + 1 ## if more: ## print o pk = prev.keys() pk.sort() names = objects.keys() names.sort() for name in names: if limit == 0 or objects[name] > limit: if not prev.has_key(name): prev[name] = objects[name] dt = objects[name] - prev[name] if delta or dt != 0: print '%0.6d -- %0.6d -- ' % (dt, objects[name]), name prev[name] = objects[name]
def getObjects(oname): """ gets an object list with all the named objects out of the sea of gc'ed objects """ olist = [] objects = {} gc.collect() oo = gc.get_objects() for o in oo: if getattr(o, "__class__", None): name = o.__class__.__name__ if (name == oname): olist.append(o) return olist
dumpObject = dumpobj.dumpObj
I modified a tolerably leaking (about 40MB/day) Twisted server: when the new code was pushed, the memory leak became catastrophic (about 100MB/hr). We could tolerate 40MB/day, but the new code needs to be debugged. First, I reverted to the old version, that is leaking 40MB/day (The leak rate is actually proportional to the number of new connections per second, (which correlates with the CPU utilization of the process): if CPU as measured by $top jumps to >90%, the leak can accelerate to 50MB/hr) I took two steps to debug the leak:
1) Using guppy/heapy via manhole:
hp = hpy() h = hp.heap() h Partition of a set of 1157084 objects. Total size = 140911144 bytes. Index Count % Size % Cumulative % Kind (class / dict of class) 0 785798 68 48463680 34 48463680 34 str 1 7357 1 24660664 18 73124344 52 dict of service.TagSession 2 11735 1 12298280 9 85422624 61 dict of twisted.internet.base.DelayedCall 3 7377 1 7731096 5 93153720 66 dict of twisted.internet.tcp.Server 4 7375 1 7729000 5 100882720 72 dict of
Alec Matusis wrote: protocols.TagProtocol
5 30925 3 7174600 5 108057320 77 __builtin__.set 6 9193 1 6373336 5 114430656 81 dict (no owner) 7 15557 1 3396904 2 117827560 84 list 8 44833 4 3227976 2 121055536 86 types.BuiltinFunctionType 9 38142 3 3051360 2 124106896 88 types.MethodType <328 more rows. Type e.g. '_.more' to view.>
Note that the total size of all objects is 140911144 bytes The 1st, 3d and 4th items in this list show the actual number of
clients. I wait for about 48 hrs, and then execute the same command, and I see approximately the same Total size, of 130MB. So the total size that Heapy sees via the manhole is stable, fluctuating around 140MB.
The problem is that the total RSS size of the process visible by the OS is much larger, it is 871680KB = 851MB: $ps -o pid,vsz,rss,sz,size -p 11236 PID VSZ RSS SZ SZ 11236 1303180 871680 325795 1174388
It is this total RSS size that keeps leaking at 40MB per day.
As far as I understand, this means that this is not the problem with
Python code, since Heapy shows that the total size of all Python objects is more or less constant. Is this a correct assumption?
2) So now I turn to valgrind. I am no expert in using valgrind, so what I did was based only on general logic/rough guesses. Since I cannot run this under valgrind on a production machine due to performance reasons, I recompile python on the staging machine: ./configure --enable-shared --without-pymalloc --prefix=/nail/encap/python-2.6.4-valgrind I then follow the instructions in http://svn.python.org/projects/python/trunk/Misc/README.valgrind Then I run twistd process like this:
valgrind --tool=memcheck --suppressions=/nail/sys/src/Python-2.6.4/Misc/valgrind-python.supp --leak-check=full --log-file=/tmp/valgrind.log /usr/local/bin/twistd --no_save --reactor=epoll --pidfile=logs/tagserv.pid --logfile=logs/tagserv.log --python=tagserv.py
The memory for the process shown by the OS is now 5x normal, and the performance is about 5x worse, since it's running inside valgrind's synthetic CPU. Because this is done on the staging box where I cannot accurately reproduce the real load, the memory leaks from simulated load seen by $ps -o rss are pretty small, about 1 to 10MB. Still, I am interested in finding out what they are.
Now I encounter the problems with my understanding of how to use valgrind. This may not be the appropriate list for this, but perhaps someone could recognize the problem. When I start the server, about 240 lines is written to valgrind log file, --log-file=/tmp/valgrind.log When I shut it down, it adds another 100 lines. No matter what I do in between, it always results in the log file with exactly 343 lines. I can have server runs with a leak of 1MB, or with 10MB, but in the end, I get pretty much the same log file. Moreover, when look for all lost memory reports: $grep lost valgrind.log.1 17,352 bytes in 31 blocks are possibly lost in loss record 49 of 62 203,312 bytes in 478 blocks are possibly lost in loss record 57 of 62 definitely lost: 0 bytes in 0 blocks. possibly lost: 220,664 bytes in 509 blocks. 64 bytes in 2 blocks are definitely lost in loss record 12 of 63 17,352 bytes in 31 blocks are possibly lost in loss record 50 of 63 203,824 bytes in 479 blocks are possibly lost in loss record 58 of 63 definitely lost: 64 bytes in 2 blocks. possibly lost: 221,176 bytes in 510 blocks. 47 bytes in 1 blocks are definitely lost in loss record 8 of 63 128 bytes in 4 blocks are definitely lost in loss record 16 of 63 584 (104 direct, 480 indirect) bytes in 2 blocks are definitely lost in loss record 26 of 63 1,008 bytes in 6 blocks are definitely lost in loss record 31 of 63 22,296 bytes in 41 blocks are possibly lost in loss record 50 of 63 183,368 bytes in 381 blocks are possibly lost in loss record 59 of 63 definitely lost: 1,287 bytes in 13 blocks. indirectly lost: 480 bytes in 20 blocks. possibly lost: 205,664 bytes in 422 blocks.
If I add up all those numbers, I get less than 1MB. How do I track down
connected purely the
10MB leak?
Are there any alternative strategies in finding this leak?
_______________________________________________ Twisted-Python mailing list Twisted-Python@twistedmatrix.com http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python
_______________________________________________ Twisted-Python mailing list Twisted-Python@twistedmatrix.com http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python
2010/2/20 Alec Matusis <matusis@yahoo.com>:
Thank you Werner!
I am playing with your object counter. But before I can interpret the results, I have to ask a possibly stupid question. Can someone explain to me this behavior of python 2.6 on x86 linux 2.6.24:
1) start interpreter Check the size of the interpreter process: $ps -orss -p24316 RSS 4212
Size is 4MB, fine.
2) Allocate a huge chunk of memory:
x=range(1,10000000) Check the size of the interpreter process: $ps -orss -p24316 RSS 322460
The RSS memory grew by 322460KB =314MB, as I expected
3) Delete this object:
del x Check the size of the interpreter process: $ps -orss -p24316 RSS 244332
So, the memory is far off its original value! It's up by nearly 240MB, and it never gets released. Why is that?
4)
import gc gc.collect() 0 $ps -orss -p24316 RSS 244404 Not much luck here!
Repeat steps 2 to 4. No more memory is used. Python don't leaks more memory. Repeat the steps reserving memory like this: x = [0] * 10000000 The memory is cleaned. The problem is in the virtual machine use of integers. Excuse my poor explanation. My english is very bad, Reggards, Javi
-----Original Message----- From: twisted-python-bounces@twistedmatrix.com [mailto:twisted-python- bounces@twistedmatrix.com] On Behalf Of Werner Thie Sent: Friday, February 19, 2010 4:10 PM To: Twisted general discussion Subject: Re: [Twisted-Python] debugging a memory leak
Hi Alec
...and they promised you that with a gc'ed language there will never be a memory problem again, you just plain forget about it.
I was stuck in the same position as you and after lots of probing the following attempt helped a lot to correct what was later proofed to be overly optimistic coding by holding on to objects for performance/practical reasons in other objects. Producing non collect able cycles in twisted is probably as easy as to forget about memory when you have Alzheimer.
Proofing and working on the problem was only possible on the production machine under real load situations. I went ahead and created a manhole service on the production server, allowing me to peek at the python object space without disturbing it too much. What I used as a tool was the code you find later on included.
After cleaning all the self produced cycles out our servers processes stabilized at roughly 280 to 320 MB per process and are now running stable for months with more than 20k logins per day and a usual time of connect per user on the average of 25 minutes playing games delivered by nevow/athena LivePages.
As I said before, all cycles I found in our SW were introduced by patterns like
def beforeRender(self, ctx): self.session = inevow.ISession(ctx)
The included code helps to identify the amount of objects being around. Although it's a primitive tool it shines the light where its needed and if you see certain object counts run away then you have at least identified the surrounding where the non collect able cycles are built.
Why didn't I use heapy/guppy and found out that way? I wasn't able to find the evidence for what I was suspecting with all the tools I tried (and boy I tried for WEEKS). Avid users of heapy will most probably disagree and tell me it would have been easy. But in a situation as this everything that works to move you out of that pothole you're in is the right thing to do.
HTH, Werner
exc = [ "function", "type", "list", "dict", "tuple", "wrapper_descriptor", "module", "method_descriptor", "member_descriptor", "instancemethod", "builtin_function_or_method", "frame", "classmethod", "classmethod_descriptor", "_Environ", "MemoryError", "_Printer", "_Helper", "getset_descriptor", "weakreaf" ]
inc = [ 'myFirstSuspect', 'mySecondSuspect' ]
prev = {}
def dumpObjects(delta=True, limit=0, include=inc, exclude=[]): global prev if include != [] and exclude != []: print 'cannot use include and exclude at the same time' return print 'working with:' print ' delta: ', delta print ' limit: ', limit print ' include: ', include print ' exclude: ', exclude objects = {} gc.collect() oo = gc.get_objects() for o in oo: if getattr(o, "__class__", None): name = o.__class__.__name__ if ((exclude == [] and include == []) or \ (exclude != [] and name not in exclude) or \ (include != [] and name in include)): objects[name] = objects.get(name, 0) + 1 ## if more: ## print o pk = prev.keys() pk.sort() names = objects.keys() names.sort() for name in names: if limit == 0 or objects[name] > limit: if not prev.has_key(name): prev[name] = objects[name] dt = objects[name] - prev[name] if delta or dt != 0: print '%0.6d -- %0.6d -- ' % (dt, objects[name]), name prev[name] = objects[name]
def getObjects(oname): """ gets an object list with all the named objects out of the sea of gc'ed objects """ olist = [] objects = {} gc.collect() oo = gc.get_objects() for o in oo: if getattr(o, "__class__", None): name = o.__class__.__name__ if (name == oname): olist.append(o) return olist
dumpObject = dumpobj.dumpObj
Alec Matusis wrote:
I modified a tolerably leaking (about 40MB/day) Twisted server: when the new code was pushed, the memory leak became catastrophic (about 100MB/hr). We could tolerate 40MB/day, but the new code needs to be debugged. First, I reverted to the old version, that is leaking 40MB/day (The leak rate is actually proportional to the number of new connections per second, (which correlates with the CPU utilization of the process): if CPU as measured by $top jumps to >90%, the leak can accelerate to 50MB/hr) I took two steps to debug the leak:
1) Using guppy/heapy via manhole:
hp = hpy() h = hp.heap() h Partition of a set of 1157084 objects. Total size = 140911144 bytes. Index Count % Size % Cumulative % Kind (class / dict of class) 0 785798 68 48463680 34 48463680 34 str 1 7357 1 24660664 18 73124344 52 dict of service.TagSession 2 11735 1 12298280 9 85422624 61 dict of twisted.internet.base.DelayedCall 3 7377 1 7731096 5 93153720 66 dict of twisted.internet.tcp.Server 4 7375 1 7729000 5 100882720 72 dict of
5 30925 3 7174600 5 108057320 77 __builtin__.set 6 9193 1 6373336 5 114430656 81 dict (no owner) 7 15557 1 3396904 2 117827560 84 list 8 44833 4 3227976 2 121055536 86 types.BuiltinFunctionType 9 38142 3 3051360 2 124106896 88 types.MethodType <328 more rows. Type e.g. '_.more' to view.>
Note that the total size of all objects is 140911144 bytes The 1st, 3d and 4th items in this list show the actual number of
connected
clients. I wait for about 48 hrs, and then execute the same command, and I see approximately the same Total size, of 130MB. So the total size that Heapy sees via the manhole is stable, fluctuating around 140MB.
The problem is that the total RSS size of the process visible by the OS is much larger, it is 871680KB = 851MB: $ps -o pid,vsz,rss,sz,size -p 11236 PID VSZ RSS SZ SZ 11236 1303180 871680 325795 1174388
It is this total RSS size that keeps leaking at 40MB per day.
As far as I understand, this means that this is not the problem with
Python code, since Heapy shows that the total size of all Python objects is more or less constant. Is this a correct assumption?
2) So now I turn to valgrind. I am no expert in using valgrind, so what I did was based only on general logic/rough guesses. Since I cannot run this under valgrind on a production machine due to performance reasons, I recompile python on the staging machine: ./configure --enable-shared --without-pymalloc --prefix=/nail/encap/python-2.6.4-valgrind I then follow the instructions in http://svn.python.org/projects/python/trunk/Misc/README.valgrind Then I run twistd process like this:
valgrind --tool=memcheck --suppressions=/nail/sys/src/Python-2.6.4/Misc/valgrind-python.supp --leak-check=full --log-file=/tmp/valgrind.log /usr/local/bin/twistd --no_save --reactor=epoll --pidfile=logs/tagserv.pid --logfile=logs/tagserv.log --python=tagserv.py
The memory for the process shown by the OS is now 5x normal, and the performance is about 5x worse, since it's running inside valgrind's synthetic CPU. Because this is done on the staging box where I cannot accurately reproduce the real load, the memory leaks from simulated load seen by $ps -o rss are pretty small, about 1 to 10MB. Still, I am interested in finding out what they are.
Now I encounter the problems with my understanding of how to use valgrind. This may not be the appropriate list for this, but perhaps someone could recognize the problem. When I start the server, about 240 lines is written to valgrind log file, --log-file=/tmp/valgrind.log When I shut it down, it adds another 100 lines. No matter what I do in between, it always results in the log file with exactly 343 lines. I can have server runs with a leak of 1MB, or with 10MB, but in the end, I get pretty much the same log file. Moreover, when look for all lost memory reports: $grep lost valgrind.log.1 17,352 bytes in 31 blocks are possibly lost in loss record 49 of 62 203,312 bytes in 478 blocks are possibly lost in loss record 57 of 62 definitely lost: 0 bytes in 0 blocks. possibly lost: 220,664 bytes in 509 blocks. 64 bytes in 2 blocks are definitely lost in loss record 12 of 63 17,352 bytes in 31 blocks are possibly lost in loss record 50 of 63 203,824 bytes in 479 blocks are possibly lost in loss record 58 of 63 definitely lost: 64 bytes in 2 blocks. possibly lost: 221,176 bytes in 510 blocks. 47 bytes in 1 blocks are definitely lost in loss record 8 of 63 128 bytes in 4 blocks are definitely lost in loss record 16 of 63 584 (104 direct, 480 indirect) bytes in 2 blocks are definitely lost in loss record 26 of 63 1,008 bytes in 6 blocks are definitely lost in loss record 31 of 63 22,296 bytes in 41 blocks are possibly lost in loss record 50 of 63 183,368 bytes in 381 blocks are possibly lost in loss record 59 of 63 definitely lost: 1,287 bytes in 13 blocks. indirectly lost: 480 bytes in 20 blocks. possibly lost: 205,664 bytes in 422 blocks.
If I add up all those numbers, I get less than 1MB. How do I track down
Werner I am using your code, and it shows essentially the same thing as Heapy: the counts of all common objects more or less agree. The 'Total size' shown in Heapy When I start the process, both python object sizes and their counts rise proportionally to the numbers of reconnected clients, and then they stabilize after all clients have reconnected. At that moment, the "external" RSS process size is about 260MB. The "internal size" of all python objects reported by Heapy is about 150MB. After two days, the internal sizes/counts stay the same, but the external size grows to 1500MB. Python object counts/total sizes are measured from the manhole. Is this sufficient to conclude that this is a C memory leak in one of the external modules or in the Python interpreter itself? protocols.TagProtocol purely the
10MB leak?
Are there any alternative strategies in finding this leak?
_______________________________________________ Twisted-Python mailing list Twisted-Python@twistedmatrix.com http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python
_______________________________________________ Twisted-Python mailing list Twisted-Python@twistedmatrix.com http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python
On Tuesday 23 February 2010, Alec Matusis wrote:
When I start the process, both python object sizes and their counts rise proportionally to the numbers of reconnected clients, and then they stabilize after all clients have reconnected. At that moment, the "external" RSS process size is about 260MB. The "internal size" of all python objects reported by Heapy is about 150MB. After two days, the internal sizes/counts stay the same, but the external size grows to 1500MB.
Python object counts/total sizes are measured from the manhole. Is this sufficient to conclude that this is a C memory leak in one of the external modules or in the Python interpreter itself?
In general, there are other reasons why heap size and RSS size do not match: 1. pages are empty but not returned to the OS 2. pages cannot be returned to the OS because they are not completely empty It seems Python has different allocators for small and large objects: http://www.mail-archive.com/python-list@python.org/msg256116.html http://effbot.org/pyfaq/why-doesnt-python-release-the-memory-when-i-delete- a-large-object.htm Assuming Python uses malloc for all its allocations (does it?), it is the malloc implementation that determines whether empty pages are returned to the OS. Under Linux with glibc (your system?), empty pages are returned, so there reason 1 does not apply. Depending on the allocation behaviour of Python, the pages may not be empty though, so reason 2 is a likely suspect. Python extensions written in C could also leak or fragment memory. Are you using any extensions that are not pure Python? Bye, Maarten
-----Original Message----- From: twisted-python-bounces@twistedmatrix.com [mailto:twisted-python- bounces@twistedmatrix.com] On Behalf Of Maarten ter Huurne Sent: Monday, February 22, 2010 6:24 PM To: Twisted general discussion Subject: Re: [Twisted-Python] debugging a memory leak
On Tuesday 23 February 2010, Alec Matusis wrote:
When I start the process, both python object sizes and their counts rise proportionally to the numbers of reconnected clients, and then they stabilize after all clients have reconnected. At that moment, the "external" RSS process size is about 260MB. The "internal size" of all python objects reported by Heapy is about 150MB. After two days, the internal sizes/counts stay the same, but the external size grows to 1500MB.
Python object counts/total sizes are measured from the manhole. Is this sufficient to conclude that this is a C memory leak in one of
Hi Maarten, Your link http://effbot.org/pyfaq/why-doesnt-python-release-the-memory-when-i-delete- a-large-object.htm seems to suggest that even though the interpreter does not release memory back to the OS, it can be re-used by the interpreter. If this was our problem, I'd expect the memory to be set by the highest usage, as opposed to it constantly leaking: in my case, the load is virtually constant, but the memory still leaks over time. The environment is Linux 2.6.24 x86-64, the extensions used are MySQLdb, pyCrypto (latest stable releases for both). the
external modules or in the Python interpreter itself?
In general, there are other reasons why heap size and RSS size do not match: 1. pages are empty but not returned to the OS 2. pages cannot be returned to the OS because they are not completely empty
It seems Python has different allocators for small and large objects: http://www.mail-archive.com/python-list@python.org/msg256116.html http://effbot.org/pyfaq/why-doesnt-python-release-the-memory-when-i- delete- a-large-object.htm
Assuming Python uses malloc for all its allocations (does it?), it is the malloc implementation that determines whether empty pages are returned to the OS. Under Linux with glibc (your system?), empty pages are returned, so there reason 1 does not apply.
Depending on the allocation behaviour of Python, the pages may not be empty though, so reason 2 is a likely suspect.
Python extensions written in C could also leak or fragment memory. Are you using any extensions that are not pure Python?
Bye, Maarten
_______________________________________________ Twisted-Python mailing list Twisted-Python@twistedmatrix.com http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python
Hi Assuming that if memory not released to the OS can be reused by the interpreter because of a suballocation system used in the interpreter should eventually lead to a leveling out of the overall memory usage over time, that's what I observe with our processes (sitting at several 100 MB per process). We are using external C libraries which do lots of malloc/free and one of the bigger sources of pain is indeed to bring such a library to a point where its clean not only by freeing all memory allocated in every circumstance but also Python refcounting wise. I usually go thru all the motions to build up a complete debug chain for all modules involved in a project and write a test bed to proof clean and proper implementation. So if your using C/C++ based modules in your project I would mark them as highly suspicious to be responsible for leaks until proven otherwise. Not to bother you with numbers but I usually allocate about 30% of overall project time to bring a server into a production ready state, meaning uptimes of months/years, no fishy feelings, no performance oscillations, predictable caving and recuperating when overloaded, just all the things you have to tick to sign off a project as completed, meaning you don't have to do daily 'tire kicking' maintenance and periodic reboots. Werner Alec Matusis wrote: > Hi Maarten, > > Your link > http://effbot.org/pyfaq/why-doesnt-python-release-the-memory-when-i-delete- > a-large-object.htm > seems to suggest that even though the interpreter does not release memory > back to the OS, it can be re-used by the interpreter. > If this was our problem, I'd expect the memory to be set by the highest > usage, as opposed to it constantly leaking: in my case, the load is > virtually constant, but the memory still leaks over time. > > The environment is Linux 2.6.24 x86-64, the extensions used are MySQLdb, > pyCrypto (latest stable releases for both). > >> -----Original Message----- >> From: twisted-python-bounces@twistedmatrix.com [mailto:twisted-python- >> bounces@twistedmatrix.com] On Behalf Of Maarten ter Huurne >> Sent: Monday, February 22, 2010 6:24 PM >> To: Twisted general discussion >> Subject: Re: [Twisted-Python] debugging a memory leak >> >> On Tuesday 23 February 2010, Alec Matusis wrote: >> >>> When I start the process, both python object sizes and their counts rise >>> proportionally to the numbers of reconnected clients, and then they >>> stabilize after all clients have reconnected. >>> At that moment, the "external" RSS process size is about 260MB. The >>> "internal size" of all python objects reported by Heapy is about 150MB. >>> After two days, the internal sizes/counts stay the same, but the > external >>> size grows to 1500MB. >>> >>> Python object counts/total sizes are measured from the manhole. >>> Is this sufficient to conclude that this is a C memory leak in one of > the >>> external modules or in the Python interpreter itself? >> In general, there are other reasons why heap size and RSS size do not > match: >> 1. pages are empty but not returned to the OS >> 2. pages cannot be returned to the OS because they are not completely > empty >> It seems Python has different allocators for small and large objects: >> http://www.mail-archive.com/python-list@python.org/msg256116.html >> http://effbot.org/pyfaq/why-doesnt-python-release-the-memory-when-i- >> delete- >> a-large-object.htm >> >> Assuming Python uses malloc for all its allocations (does it?), it is the >> malloc implementation that determines whether empty pages are returned to >> the OS. Under Linux with glibc (your system?), empty pages are returned, > so >> there reason 1 does not apply. >> >> Depending on the allocation behaviour of Python, the pages may not be >> empty >> though, so reason 2 is a likely suspect. >> >> Python extensions written in C could also leak or fragment memory. Are you >> using any extensions that are not pure Python? >> >> Bye, >> Maarten >> >> _______________________________________________ >> Twisted-Python mailing list >> Twisted-Python@twistedmatrix.com >> http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python > > > _______________________________________________ > Twisted-Python mailing list > Twisted-Python@twistedmatrix.com > http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python
In desperation of not finding the real memory leak on the production server, I wrote a test server that I can push to arbitrary high RSS memory. I am far from sure if this the same leak that I observe in production, but I would like to understand what this one is. This is the server code: Server.py: import twisted.protocols.basic from twisted.internet.protocol import Factory from twisted.internet import reactor class HiRate(twisted.protocols.basic.LineOnlyReceiver): MAX_LENGTH = 20000 def lineReceived(self, line): if line == 'get': out = 'a'*4000+'\r\r' self.transport.write(out) factory = Factory() factory.protocol = HiRate reactor.listenTCP(8007, factory, backlog=50, interface='10.18.0.2') reactor.run() This server has to be flooded by "get" requests from this client: Client.py: import socket, time HOST='10.18.0.2' PORT=8007 def client(): """high rate client, needs a dedicated CPU to run""" s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) try: s.connect((HOST,PORT)) except socket.error, e: print 'client error is %s' %e return n=0 while 1: #print "iter %s" %n #time.sleep(.001) s.send('get\r\n') s.setblocking(0) try: r = s.recv(1024) except: r=0 while r: #print r try: r=s.recv(1024) except socket.error, e: r=0 s.setblocking(1) n+=1 client() To reproduce the memory leak, I either need two machines with fast LAN between them (since the client program takes 100% CPU), or possibly one machine with a dual core CPU (I have not tried that). It is important that client.py is given a separate CPU to run. When the length of the response from the server is sufficient, (out = 'a'*4000+'\r\r' , 4000 is enough in my case), the RSS of the server process starts to leak without a bound. If you introduce a small delay in the client (#time.sleep(.001)), the leak does not occur. Looking at tcpdump on the server machine, I sometimes see many "get" packets from the client in a row, that are not followed by response packets from the server with payload 'aaaaa...'. Only when the server is in this "overwhelmed" state, the memory seems to grow unbounded. I first thought it may be an issue of the unbounded send queue on the server, but the examination of Send-Q with netstat shows that Send-Q saturates to a certain ceiling value, while the RSS memory of the server process continues to grow. Here are some commands I was using to watch the parameters of the server: Watch send-Q and recv-Q: root$ watch -n1 netstat -an RSS memory of the server: root$ watch -n1 ps -orss -p`netstat -nlp | grep :8007 | awk '{print $7}' | cut -d/ -f1` Traffic to/from the server: root$ tcpdump -A -s10024 -nn -i eth1 'port 8007' (in my case I use eth1 for LAN to the client) > -----Original Message----- > From: twisted-python-bounces@twistedmatrix.com [mailto:twisted-python- > bounces@twistedmatrix.com] On Behalf Of Werner Thie > Sent: Monday, February 22, 2010 11:39 PM > To: Twisted general discussion > Subject: Re: [Twisted-Python] debugging a memory leak > > Hi > > Assuming that if memory not released to the OS can be reused by the > interpreter because of a suballocation system used in the interpreter > should eventually lead to a leveling out of the overall memory usage > over time, that's what I observe with our processes (sitting at several > 100 MB per process). We are using external C libraries which do lots of > malloc/free and one of the bigger sources of pain is indeed to bring > such a library to a point where its clean not only by freeing all memory > allocated in every circumstance but also Python refcounting wise. I > usually go thru all the motions to build up a complete debug chain for > all modules involved in a project and write a test bed to proof clean > and proper implementation. > > So if your using C/C++ based modules in your project I would mark them > as highly suspicious to be responsible for leaks until proven otherwise. > > Not to bother you with numbers but I usually allocate about 30% of > overall project time to bring a server into a production ready state, > meaning uptimes of months/years, no fishy feelings, no performance > oscillations, predictable caving and recuperating when overloaded, just > all the things you have to tick to sign off a project as completed, > meaning you don't have to do daily 'tire kicking' maintenance and > periodic reboots. > > Werner > > Alec Matusis wrote: > > Hi Maarten, > > > > Your link > > http://effbot.org/pyfaq/why-doesnt-python-release-the-memory-when-i- > delete- > > a-large-object.htm > > seems to suggest that even though the interpreter does not release memory > > back to the OS, it can be re-used by the interpreter. > > If this was our problem, I'd expect the memory to be set by the highest > > usage, as opposed to it constantly leaking: in my case, the load is > > virtually constant, but the memory still leaks over time. > > > > The environment is Linux 2.6.24 x86-64, the extensions used are MySQLdb, > > pyCrypto (latest stable releases for both). > > > >> -----Original Message----- > >> From: twisted-python-bounces@twistedmatrix.com [mailto:twisted- > python- > >> bounces@twistedmatrix.com] On Behalf Of Maarten ter Huurne > >> Sent: Monday, February 22, 2010 6:24 PM > >> To: Twisted general discussion > >> Subject: Re: [Twisted-Python] debugging a memory leak > >> > >> On Tuesday 23 February 2010, Alec Matusis wrote: > >> > >>> When I start the process, both python object sizes and their counts rise > >>> proportionally to the numbers of reconnected clients, and then they > >>> stabilize after all clients have reconnected. > >>> At that moment, the "external" RSS process size is about 260MB. The > >>> "internal size" of all python objects reported by Heapy is about 150MB. > >>> After two days, the internal sizes/counts stay the same, but the > > external > >>> size grows to 1500MB. > >>> > >>> Python object counts/total sizes are measured from the manhole. > >>> Is this sufficient to conclude that this is a C memory leak in one of > > the > >>> external modules or in the Python interpreter itself? > >> In general, there are other reasons why heap size and RSS size do not > > match: > >> 1. pages are empty but not returned to the OS > >> 2. pages cannot be returned to the OS because they are not completely > > empty > >> It seems Python has different allocators for small and large objects: > >> http://www.mail-archive.com/python-list@python.org/msg256116.html > >> http://effbot.org/pyfaq/why-doesnt-python-release-the-memory-when-i- > >> delete- > >> a-large-object.htm > >> > >> Assuming Python uses malloc for all its allocations (does it?), it is the > >> malloc implementation that determines whether empty pages are returned > to > >> the OS. Under Linux with glibc (your system?), empty pages are returned, > > so > >> there reason 1 does not apply. > >> > >> Depending on the allocation behaviour of Python, the pages may not be > >> empty > >> though, so reason 2 is a likely suspect. > >> > >> Python extensions written in C could also leak or fragment memory. Are > you > >> using any extensions that are not pure Python? > >> > >> Bye, > >> Maarten > >> > >> _______________________________________________ > >> Twisted-Python mailing list > >> Twisted-Python@twistedmatrix.com > >> http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python > > > > > > _______________________________________________ > > Twisted-Python mailing list > > Twisted-Python@twistedmatrix.com > > http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python > > _______________________________________________ > Twisted-Python mailing list > Twisted-Python@twistedmatrix.com > http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python
Alec Matusis wrote:
In desperation of not finding the real memory leak on the production server,
I wrote a test server that I can push to arbitrary high RSS memory. I am far from sure if this the same leak that I observe in production, but I would like to understand what this one is. This is the server code:
Server.py:
import twisted.protocols.basic from twisted.internet.protocol import Factory from twisted.internet import reactor
class HiRate(twisted.protocols.basic.LineOnlyReceiver): MAX_LENGTH = 20000
def lineReceived(self, line): if line == 'get': out = 'a'*4000+'\r\r' self.transport.write(out)
factory = Factory() factory.protocol = HiRate
reactor.listenTCP(8007, factory, backlog=50, interface='10.18.0.2') reactor.run() [...] To reproduce the memory leak, I either need two machines with fast LAN between them (since the client program takes 100% CPU), or possibly one machine with a dual core CPU (I have not tried that). It is important that client.py is given a separate CPU to run. When the length of the response from the server is sufficient, (out = 'a'*4000+'\r\r' , 4000 is enough in my case), the RSS of the server process starts to leak without a bound. If you introduce a small delay in the client (#time.sleep(.001)), the leak does not occur.
Hi Alec, I wouldn't call that a memory leak. In your example ( SC/s = (n* CS) / s ; n > 1, SC: bytes sent to the client, CS: bytes sent to the server ) there is some CS that satifies the condition SC/s > [available bandwidth]. For all CS that exceed that limit, the data will be first queued in the Send-Q of the socket and then inside twisted, because the Send-Q is full. That's the reason why you see increasing memory usage, but it's not a leak, it is a server that can not cope with this kind of DOS attack. Twisted can not handle this situation for you because that would mean it had to decide which packets, connections, etc. to drop. Such decisions have to be made by your application. regards, Johann
On 01:24 am, johann.borck@densedata.com wrote:
In desperation of not finding the real memory leak on the production server,
I wrote a test server that I can push to arbitrary high RSS memory. I am far from sure if this the same leak that I observe in production, but I would like to understand what this one is. This is the server code:
Server.py:
import twisted.protocols.basic from twisted.internet.protocol import Factory from twisted.internet import reactor
class HiRate(twisted.protocols.basic.LineOnlyReceiver): MAX_LENGTH = 20000
def lineReceived(self, line): if line == 'get': out = 'a'*4000+'\r\r' self.transport.write(out)
factory = Factory() factory.protocol = HiRate
reactor.listenTCP(8007, factory, backlog=50, interface='10.18.0.2') reactor.run() [...] To reproduce the memory leak, I either need two machines with fast LAN between them (since the client program takes 100% CPU), or possibly one machine with a dual core CPU (I have not tried that). It is important that client.py is given a separate CPU to run. When the length of the response from the server is sufficient, (out = 'a'*4000+'\r\r' , 4000 is enough in my case), the RSS of the server process starts to leak without a bound. If you introduce a small delay in the client (#time.sleep(.001)), the leak does not occur. Hi Alec, I wouldn't call that a memory leak. In your example ( SC/s = (n* CS) / s ; n > 1, SC: bytes sent to the client, CS: bytes sent to the server )
Alec Matusis wrote: there is some CS that satifies the condition SC/s > [available bandwidth]. For all CS that exceed that limit, the data will be first queued in the Send-Q of the socket and then inside twisted, because the Send-Q is full. That's the reason why you see increasing memory usage, but it's not a leak, it is a server that can not cope with this kind of DOS attack. Twisted can not handle this situation for you because that would mean it had to decide which packets, connections, etc. to drop. Such decisions have to be made by your application.
You can get notifications about when your send buffers are full by registering a producer with the transport. http://twistedmatrix.com/documents/current/core/howto/producers.html describes the APIs involved. This will help you make these decisions in your application. Jean-Paul
It turns out that the situation described in my last email, where the server buffers strings due to the socket Send-Q saturation never arises in our production system, it was an artifact of high load testing in the lab environment. I "fixed" the memory leak, and I still do not understand what caused it. The code looked like this: File protocols.py: class MyProtocol(twisted.protocols.basic.LineOnlyReceiver): def lineReceived(self, line): def login_cb(session): self.session = session #this deferred chain creates an instance of Session in the end d = self.factory.service.login_deferred(Callback(self), line) d.addErrback(twisted.python.log.err) #adds some properties to the session instance d.addCallback(self.factory.service.authenticate) d.addErrback(twisted.python.log.err) d.addCallback(login_cb) d.addErrback(twisted.python.log.err) class Callback: def __init__(self, protocol): self.protocol = protocol File service.py (protocol.factory.service): class Session: def __init__(self, service, callback): self.callback = callback self.service = service def do_something_asynchronous(self): #this function creates MEMORY LEAK def cb(result): self.some_property = result return self d = some_asynch_operation() d.addCallback(cb) d.addErrback(twisted.python.log.err) class Service: def login_deferred(self, callback, line): def cb(result): s = Session(self, callback) return s d = asynch_operation(line) d.addCallback(cb) d.addErrback(twisted.python.log.err) return d def authenticate(self, session): #the deferred below caused the MEMORY LEAK #when its code was moved into here from Session.do_something_asynchronous #the leak disappered d = session.do_something_asynchronous() return d The problem was in the function Session.do_something_asynchronous , that returned the instance of its class. Interestingly, even though this was causing an unbounded memory leak visible from the OS, the number of objects tracked by garbage collector len(gc.get_objects()) remained stable, and also none of the objects that had len() were getting too long. Nothing could be seen by Heapy either, When I removed this function, and moved its code into Service.authenticate, the memory stabilized. I still do not quite understand why, I was just acting on the intuition that when a function returns its class instance, it could create a reference cycle. After the removal of this function, the server exhibits the following behavior: when it starts and after all clients connect and it reaches its average load, the memory stays at 350MB for about 1.5hrs. Then it suddenly jumps to 650MB, and stays there. Since at 350MB the server already serves the maximum number of connection (and is operating at its normal load), I think that the jump to 650MB is caused by some additional memory leak. Could there be anything wrong with this code pattern, where I launch deferreds from other deferreds (e.g. my function Service.login_deferred , which itself is a part of a deferred chain, launches another deferred function cb() within itself)? Also this code has the following reference cycle: protocol.session -> session.callback.protocol. This cycle existed in the code for a long time and never caused unbounded memory, even though the server creates 100 of Session()s per sec, and they seem to be always garbage-collected (by inspecting gc.get_objects()), even with python 2.4 before the improvements of the cyclical garbage collector.
-----Original Message----- From: twisted-python-bounces@twistedmatrix.com [mailto:twisted-python- bounces@twistedmatrix.com] On Behalf Of exarkun@twistedmatrix.com Sent: Thursday, February 25, 2010 4:51 PM To: Twisted general discussion Subject: Re: [Twisted-Python] debugging a memory leak
On 01:24 am, johann.borck@densedata.com wrote:
In desperation of not finding the real memory leak on the production server,
I wrote a test server that I can push to arbitrary high RSS memory. I am far from sure if this the same leak that I observe in production, but I would like to understand what this one is. This is the server code:
Server.py:
import twisted.protocols.basic from twisted.internet.protocol import Factory from twisted.internet import reactor
class HiRate(twisted.protocols.basic.LineOnlyReceiver): MAX_LENGTH = 20000
def lineReceived(self, line): if line == 'get': out = 'a'*4000+'\r\r' self.transport.write(out)
factory = Factory() factory.protocol = HiRate
reactor.listenTCP(8007, factory, backlog=50, interface='10.18.0.2') reactor.run() [...] To reproduce the memory leak, I either need two machines with fast LAN between them (since the client program takes 100% CPU), or possibly one machine with a dual core CPU (I have not tried that). It is important that client.py is given a separate CPU to run. When the length of the response from the server is sufficient, (out = 'a'*4000+'\r\r' , 4000 is enough in my case), the RSS of the server process starts to leak without a bound. If you introduce a small delay in the client (#time.sleep(.001)), the leak does not occur. Hi Alec, I wouldn't call that a memory leak. In your example ( SC/s = (n* CS) / s ; n > 1, SC: bytes sent to the client, CS: bytes sent to the server )
Alec Matusis wrote: there is some CS that satifies the condition SC/s > [available bandwidth]. For all CS that exceed that limit, the data will be first queued in the Send-Q of the socket and then inside twisted, because the Send-Q is full. That's the reason why you see increasing memory usage, but it's not a leak, it is a server that can not cope with this kind of DOS attack. Twisted can not handle this situation for you because that would mean it had to decide which packets, connections, etc. to drop. Such decisions have to be made by your application.
You can get notifications about when your send buffers are full by registering a producer with the transport. http://twistedmatrix.com/documents/current/core/howto/producers.html describes the APIs involved. This will help you make these decisions in your application.
Jean-Paul
_______________________________________________ Twisted-Python mailing list Twisted-Python@twistedmatrix.com http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python
participants (6)
-
Alec Matusis
-
exarkun@twistedmatrix.com
-
Johann Borck
-
lasizoillo
-
Maarten ter Huurne
-
Werner Thie