[Twisted-Python] finding out what's blocking the reactor

Hi All, On some busy instances of the app I'm working with/on, we see websocket connections being dropped due to not ping/ponging in time, I'm fairly sure this is as a result of some user-written callbacks blocking the reactor for long periods of time (seconds...). What's the best way to debug a twisted application to find out what's blocking the reactor? cheers, Chris

I don’t know about “the best”, but I’ve had some success with https://vmprof.readthedocs.io/en/latest and https://github.com/ggreer/twisted_hang depending on what you’re looking for. Let us know if these work out! -g

On Monday, 13 May 2019 15:26:01 BST Chris Withers wrote:
Is the process CPU bound or IO bound? If CPU bound then cprofile or the tool Glyph mentioned would help. This advice is from a Linux perspective. If its IO bound the you may be waiting on a blocking operation. You could use strace on the process to see what its doing. As its hung for seconds you could connect gdb to the process and see a python backtrace using the py-bt command. You are likely to catch the offending code. The normal bt command will show you what the last python command it calling. You could adding logging to your code to show the time of callback entry and exit. If all else fails patch the twisted code to time callbacks and report the slow ones. Barry

Hi all with periods of seemingly no activity the gc comes to mind. Is the working set size stable or ever increasing, swap operations might block activity of the process I had problems in the beginning with twisted, because I was holding on to a lot of objects and thus creating circular non collectible object references leading to ever increasing work sets over time, with the gc hitting me hard once in a while. First check I do always is keep an eye on the size of the process at constant network load, size shrinking when network load is dwindling and increasing when network load is growing. If it is possible to manipulate network load precisely then returning to the same load should give you the same process size after some time repeatedly. As a reference I have twisted processes running with up times measured in years on FreeBSD. Werner On 5/14/19 05:39, Scott, Barry wrote:

On Wednesday, 15 May 2019 07:30:54 CEST Chris Withers wrote:
A simple way is to call gc.disable() and see if the problem goes away. Note that gc.disable() only disables the mark-and-sweep part of garbage collection, which is the part that can cause the Python interpreter to freeze. It doesn't disable garbage collection when the refcount of an object reaches zero. Bye, Maarten

On Wednesday, 15 May 2019 13:11:29 CEST Scott, Barry wrote:
In my experience gc is low level noise only. Our work load is very heavy and its never been an issue.
It depends entirely on how many objects you have. With a few GBs worth of relatively small objects, a full garbage collection cycle can take over a minute. But in normal circumstances, the overhead indeed isn't relevant. Bye, Maarten

Aloha the code below has proven to be a life saver together with manhole. It might be a bit outdated but I usually create a manhole access allowing me to inspect the server from within then using dumpObjects to look at the sea of objects in the running process. Even only watching the inner workings of a server is quite fun and illuminating. Werner exc =[ "function", "type", "list", "dict", "tuple", "wrapper_descriptor", "module", "method_descriptor", "member_descriptor", "instancemethod", "builtin_function_or_method", "frame", "classmethod", "classmethod_descriptor", "_Environ", "MemoryError", "_Printer", "_Helper", "getset_descriptor", "weakref" ] inc =[ ] prev ={} #delta - print delta between current and previous run #limit - only print object count bigger than limit #inc - include named objects in the dump #exc - exclude named obects in the dump defdumpObjects(delta=True, limit=0, include=inc, exclude=[]): globalprev ifinclude !=[] andexclude !=[]: print'cannot use include and exclude at the same time' return print'working with:' print' delta: ', delta print' limit: ', limit print' include: ', include print' exclude: ', exclude objects ={} gc.collect() oo =gc.get_objects() foro inoo: ifgetattr(o, "__class__", None): name =o.__class__.__name__ if((exclude ==[] andinclude ==[]) or\ (exclude !=[] andname notinexclude) or\ (include !=[] andname ininclude)): objects[name] =objects.get(name, 0) +1 ## if more: ## print o pk =prev.keys() pk.sort() names =objects.keys() names.sort() forname innames: iflimit ==0orobjects[name] >limit: ifnotprev.has_key(name): prev[name] =objects[name] dt =objects[name] -prev[name] ifdelta ordt !=0: print'%0.6d-- %0.6d-- '%(dt, objects[name]), name prev[name] =objects[name] defgetObjects(oname): """ gets an object list with all the named objects out of the sea of gc'ed objects """ olist =[] objects ={} gc.collect() oo =gc.get_objects() foro inoo: ifgetattr(o, "__class__", None): name =o.__class__.__name__ if(name ==oname): olist.append(o) returnolist On 5/15/19 02:19, Maarten ter Huurne wrote:

On 14/05/2019 16:39, Scott, Barry wrote:
Yes ;-)
Good call on strace, will do so next time one of these happens.
As its hung for seconds you could connect gdb to the process and see a python backtrace using the py-bt command. You are likely to catch the offending code.
I'm failing miserably to get gdb to find debug symbols for python. I've installed the system python's debug symbols rpm, I've installed gdb .py that ships in the python source tree (so at least I *have* a py-bt command now!) - what am I likely to be missing?
You could adding logging to your code to show the time of callback entry and exit.
Right, but if the callback yields, that will be falsely shown as blocking, no? cheers, Chris

On Wednesday, 15 May 2019 06:27:30 BST Chris Withers wrote:
At least on Fedora/Centos when you run up gdb you get a message that tells you the the command to run to install missing symbols. This is what I see on Fedora 30 for example: $ gdb python3 GNU gdb (GDB) Fedora 8.2.91.20190424-24.fc30 ... skip ... For help, type "help". Type "apropos word" to search for commands related to "word"... Reading symbols from python3... (No debugging symbols found in python3) Missing separate debuginfos, use: dnf debuginfo-install python3-3.7.3-1.fc30.x86_64 (gdb) Then I quit gdb and run the sudo dnf as suggested. Then next time you run gdb it will suggest even mode debug info to install. You can often skip that, you have all you need to see python code.
I'm not used to using yield with twisted. But it will not block.
cheers,
Chris

I don’t know about “the best”, but I’ve had some success with https://vmprof.readthedocs.io/en/latest and https://github.com/ggreer/twisted_hang depending on what you’re looking for. Let us know if these work out! -g

On Monday, 13 May 2019 15:26:01 BST Chris Withers wrote:
Is the process CPU bound or IO bound? If CPU bound then cprofile or the tool Glyph mentioned would help. This advice is from a Linux perspective. If its IO bound the you may be waiting on a blocking operation. You could use strace on the process to see what its doing. As its hung for seconds you could connect gdb to the process and see a python backtrace using the py-bt command. You are likely to catch the offending code. The normal bt command will show you what the last python command it calling. You could adding logging to your code to show the time of callback entry and exit. If all else fails patch the twisted code to time callbacks and report the slow ones. Barry

Hi all with periods of seemingly no activity the gc comes to mind. Is the working set size stable or ever increasing, swap operations might block activity of the process I had problems in the beginning with twisted, because I was holding on to a lot of objects and thus creating circular non collectible object references leading to ever increasing work sets over time, with the gc hitting me hard once in a while. First check I do always is keep an eye on the size of the process at constant network load, size shrinking when network load is dwindling and increasing when network load is growing. If it is possible to manipulate network load precisely then returning to the same load should give you the same process size after some time repeatedly. As a reference I have twisted processes running with up times measured in years on FreeBSD. Werner On 5/14/19 05:39, Scott, Barry wrote:

On Wednesday, 15 May 2019 07:30:54 CEST Chris Withers wrote:
A simple way is to call gc.disable() and see if the problem goes away. Note that gc.disable() only disables the mark-and-sweep part of garbage collection, which is the part that can cause the Python interpreter to freeze. It doesn't disable garbage collection when the refcount of an object reaches zero. Bye, Maarten

On Wednesday, 15 May 2019 13:11:29 CEST Scott, Barry wrote:
In my experience gc is low level noise only. Our work load is very heavy and its never been an issue.
It depends entirely on how many objects you have. With a few GBs worth of relatively small objects, a full garbage collection cycle can take over a minute. But in normal circumstances, the overhead indeed isn't relevant. Bye, Maarten

Aloha the code below has proven to be a life saver together with manhole. It might be a bit outdated but I usually create a manhole access allowing me to inspect the server from within then using dumpObjects to look at the sea of objects in the running process. Even only watching the inner workings of a server is quite fun and illuminating. Werner exc =[ "function", "type", "list", "dict", "tuple", "wrapper_descriptor", "module", "method_descriptor", "member_descriptor", "instancemethod", "builtin_function_or_method", "frame", "classmethod", "classmethod_descriptor", "_Environ", "MemoryError", "_Printer", "_Helper", "getset_descriptor", "weakref" ] inc =[ ] prev ={} #delta - print delta between current and previous run #limit - only print object count bigger than limit #inc - include named objects in the dump #exc - exclude named obects in the dump defdumpObjects(delta=True, limit=0, include=inc, exclude=[]): globalprev ifinclude !=[] andexclude !=[]: print'cannot use include and exclude at the same time' return print'working with:' print' delta: ', delta print' limit: ', limit print' include: ', include print' exclude: ', exclude objects ={} gc.collect() oo =gc.get_objects() foro inoo: ifgetattr(o, "__class__", None): name =o.__class__.__name__ if((exclude ==[] andinclude ==[]) or\ (exclude !=[] andname notinexclude) or\ (include !=[] andname ininclude)): objects[name] =objects.get(name, 0) +1 ## if more: ## print o pk =prev.keys() pk.sort() names =objects.keys() names.sort() forname innames: iflimit ==0orobjects[name] >limit: ifnotprev.has_key(name): prev[name] =objects[name] dt =objects[name] -prev[name] ifdelta ordt !=0: print'%0.6d-- %0.6d-- '%(dt, objects[name]), name prev[name] =objects[name] defgetObjects(oname): """ gets an object list with all the named objects out of the sea of gc'ed objects """ olist =[] objects ={} gc.collect() oo =gc.get_objects() foro inoo: ifgetattr(o, "__class__", None): name =o.__class__.__name__ if(name ==oname): olist.append(o) returnolist On 5/15/19 02:19, Maarten ter Huurne wrote:

On 14/05/2019 16:39, Scott, Barry wrote:
Yes ;-)
Good call on strace, will do so next time one of these happens.
As its hung for seconds you could connect gdb to the process and see a python backtrace using the py-bt command. You are likely to catch the offending code.
I'm failing miserably to get gdb to find debug symbols for python. I've installed the system python's debug symbols rpm, I've installed gdb .py that ships in the python source tree (so at least I *have* a py-bt command now!) - what am I likely to be missing?
You could adding logging to your code to show the time of callback entry and exit.
Right, but if the callback yields, that will be falsely shown as blocking, no? cheers, Chris

On Wednesday, 15 May 2019 06:27:30 BST Chris Withers wrote:
At least on Fedora/Centos when you run up gdb you get a message that tells you the the command to run to install missing symbols. This is what I see on Fedora 30 for example: $ gdb python3 GNU gdb (GDB) Fedora 8.2.91.20190424-24.fc30 ... skip ... For help, type "help". Type "apropos word" to search for commands related to "word"... Reading symbols from python3... (No debugging symbols found in python3) Missing separate debuginfos, use: dnf debuginfo-install python3-3.7.3-1.fc30.x86_64 (gdb) Then I quit gdb and run the sudo dnf as suggested. Then next time you run gdb it will suggest even mode debug info to install. You can often skip that, you have all you need to see python code.
I'm not used to using yield with twisted. But it will not block.
cheers,
Chris
participants (5)
-
Chris Withers
-
Glyph
-
Maarten ter Huurne
-
Scott, Barry
-
Werner Thie