[Twisted-Python] finding out what's blocking the reactor

Hi All, On some busy instances of the app I'm working with/on, we see websocket connections being dropped due to not ping/ponging in time, I'm fairly sure this is as a result of some user-written callbacks blocking the reactor for long periods of time (seconds...). What's the best way to debug a twisted application to find out what's blocking the reactor? cheers, Chris

On May 13, 2019, at 7:26 AM, Chris Withers <chris@withers.org> wrote:
Hi All,
On some busy instances of the app I'm working with/on, we see websocket connections being dropped due to not ping/ponging in time, I'm fairly sure this is as a result of some user-written callbacks blocking the reactor for long periods of time (seconds...).
What's the best way to debug a twisted application to find out what's blocking the reactor?
I don’t know about “the best”, but I’ve had some success with https://vmprof.readthedocs.io/en/latest and https://github.com/ggreer/twisted_hang depending on what you’re looking for. Let us know if these work out! -g

On Monday, 13 May 2019 15:26:01 BST Chris Withers wrote:
Hi All,
On some busy instances of the app I'm working with/on, we see websocket connections being dropped due to not ping/ponging in time, I'm fairly sure this is as a result of some user-written callbacks blocking the reactor for long periods of time (seconds...).
What's the best way to debug a twisted application to find out what's blocking the reactor?
Is the process CPU bound or IO bound? If CPU bound then cprofile or the tool Glyph mentioned would help. This advice is from a Linux perspective. If its IO bound the you may be waiting on a blocking operation. You could use strace on the process to see what its doing. As its hung for seconds you could connect gdb to the process and see a python backtrace using the py-bt command. You are likely to catch the offending code. The normal bt command will show you what the last python command it calling. You could adding logging to your code to show the time of callback entry and exit. If all else fails patch the twisted code to time callbacks and report the slow ones. Barry
cheers,
Chris
_______________________________________________ Twisted-Python mailing list Twisted-Python@twistedmatrix.com https://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python

Hi all with periods of seemingly no activity the gc comes to mind. Is the working set size stable or ever increasing, swap operations might block activity of the process I had problems in the beginning with twisted, because I was holding on to a lot of objects and thus creating circular non collectible object references leading to ever increasing work sets over time, with the gc hitting me hard once in a while. First check I do always is keep an eye on the size of the process at constant network load, size shrinking when network load is dwindling and increasing when network load is growing. If it is possible to manipulate network load precisely then returning to the same load should give you the same process size after some time repeatedly. As a reference I have twisted processes running with up times measured in years on FreeBSD. Werner On 5/14/19 05:39, Scott, Barry wrote:
On Monday, 13 May 2019 15:26:01 BST Chris Withers wrote:
Hi All,
On some busy instances of the app I'm working with/on, we see websocket connections being dropped due to not ping/ponging in time, I'm fairly sure this is as a result of some user-written callbacks blocking the reactor for long periods of time (seconds...).
What's the best way to debug a twisted application to find out what's blocking the reactor? Is the process CPU bound or IO bound?
If CPU bound then cprofile or the tool Glyph mentioned would help.
This advice is from a Linux perspective.
If its IO bound the you may be waiting on a blocking operation. You could use strace on the process to see what its doing.
As its hung for seconds you could connect gdb to the process and see a python backtrace using the py-bt command. You are likely to catch the offending code.
The normal bt command will show you what the last python command it calling.
You could adding logging to your code to show the time of callback entry and exit.
If all else fails patch the twisted code to time callbacks and report the slow ones.
Barry
cheers,
Chris
_______________________________________________ Twisted-Python mailing list Twisted-Python@twistedmatrix.com https://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python
_______________________________________________ Twisted-Python mailing list Twisted-Python@twistedmatrix.com https://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python

On Wednesday, 15 May 2019 07:30:54 CEST Chris Withers wrote:
On 14/05/2019 22:14, Werner Thie wrote:
Hi all
with periods of seemingly no activity the gc comes to mind.
Hmm, gc blocking the process is an interesting thought, what evidence would you see with gdb, etc, to show that this was the problem?
A simple way is to call gc.disable() and see if the problem goes away. Note that gc.disable() only disables the mark-and-sweep part of garbage collection, which is the part that can cause the Python interpreter to freeze. It doesn't disable garbage collection when the refcount of an object reaches zero. Bye, Maarten

On Wednesday, 15 May 2019 06:30:54 BST Chris Withers wrote:
On 14/05/2019 22:14, Werner Thie wrote:
Hi all
with periods of seemingly no activity the gc comes to mind.
Hmm, gc blocking the process is an interesting thought, what evidence would you see with gdb, etc, to show that this was the problem?
In my experience gc is low level noise only. Our work load is very heavy and its never been an issue. Barry
Chris
_______________________________________________ Twisted-Python mailing list Twisted-Python@twistedmatrix.com https://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python

On Wednesday, 15 May 2019 13:11:29 CEST Scott, Barry wrote:
In my experience gc is low level noise only. Our work load is very heavy and its never been an issue.
It depends entirely on how many objects you have. With a few GBs worth of relatively small objects, a full garbage collection cycle can take over a minute. But in normal circumstances, the overhead indeed isn't relevant. Bye, Maarten

Aloha the code below has proven to be a life saver together with manhole. It might be a bit outdated but I usually create a manhole access allowing me to inspect the server from within then using dumpObjects to look at the sea of objects in the running process. Even only watching the inner workings of a server is quite fun and illuminating. Werner exc =[ "function", "type", "list", "dict", "tuple", "wrapper_descriptor", "module", "method_descriptor", "member_descriptor", "instancemethod", "builtin_function_or_method", "frame", "classmethod", "classmethod_descriptor", "_Environ", "MemoryError", "_Printer", "_Helper", "getset_descriptor", "weakref" ] inc =[ ] prev ={} #delta - print delta between current and previous run #limit - only print object count bigger than limit #inc - include named objects in the dump #exc - exclude named obects in the dump defdumpObjects(delta=True, limit=0, include=inc, exclude=[]): globalprev ifinclude !=[] andexclude !=[]: print'cannot use include and exclude at the same time' return print'working with:' print' delta: ', delta print' limit: ', limit print' include: ', include print' exclude: ', exclude objects ={} gc.collect() oo =gc.get_objects() foro inoo: ifgetattr(o, "__class__", None): name =o.__class__.__name__ if((exclude ==[] andinclude ==[]) or\ (exclude !=[] andname notinexclude) or\ (include !=[] andname ininclude)): objects[name] =objects.get(name, 0) +1 ## if more: ## print o pk =prev.keys() pk.sort() names =objects.keys() names.sort() forname innames: iflimit ==0orobjects[name] >limit: ifnotprev.has_key(name): prev[name] =objects[name] dt =objects[name] -prev[name] ifdelta ordt !=0: print'%0.6d-- %0.6d-- '%(dt, objects[name]), name prev[name] =objects[name] defgetObjects(oname): """ gets an object list with all the named objects out of the sea of gc'ed objects """ olist =[] objects ={} gc.collect() oo =gc.get_objects() foro inoo: ifgetattr(o, "__class__", None): name =o.__class__.__name__ if(name ==oname): olist.append(o) returnolist On 5/15/19 02:19, Maarten ter Huurne wrote:
On Wednesday, 15 May 2019 13:11:29 CEST Scott, Barry wrote:
In my experience gc is low level noise only. Our work load is very heavy and its never been an issue. It depends entirely on how many objects you have. With a few GBs worth of relatively small objects, a full garbage collection cycle can take over a minute. But in normal circumstances, the overhead indeed isn't relevant.
Bye, Maarten
_______________________________________________ Twisted-Python mailing list Twisted-Python@twistedmatrix.com https://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python

On 14/05/2019 16:39, Scott, Barry wrote:
On Monday, 13 May 2019 15:26:01 BST Chris Withers wrote:
Hi All,
On some busy instances of the app I'm working with/on, we see websocket connections being dropped due to not ping/ponging in time, I'm fairly sure this is as a result of some user-written callbacks blocking the reactor for long periods of time (seconds...).
What's the best way to debug a twisted application to find out what's blocking the reactor?
Is the process CPU bound or IO bound?
Yes ;-)
If CPU bound then cprofile or the tool Glyph mentioned would help.
This advice is from a Linux perspective.
If its IO bound the you may be waiting on a blocking operation. You could use strace on the process to see what its doing.
Good call on strace, will do so next time one of these happens.
As its hung for seconds you could connect gdb to the process and see a python backtrace using the py-bt command. You are likely to catch the offending code.
I'm failing miserably to get gdb to find debug symbols for python. I've installed the system python's debug symbols rpm, I've installed gdb .py that ships in the python source tree (so at least I *have* a py-bt command now!) - what am I likely to be missing?
You could adding logging to your code to show the time of callback entry and exit.
Right, but if the callback yields, that will be falsely shown as blocking, no? cheers, Chris

On Wednesday, 15 May 2019 06:27:30 BST Chris Withers wrote:
On 14/05/2019 16:39, Scott, Barry wrote:
On Monday, 13 May 2019 15:26:01 BST Chris Withers wrote:
Hi All,
On some busy instances of the app I'm working with/on, we see websocket connections being dropped due to not ping/ponging in time, I'm fairly sure this is as a result of some user-written callbacks blocking the reactor for long periods of time (seconds...).
What's the best way to debug a twisted application to find out what's blocking the reactor?
Is the process CPU bound or IO bound?
Yes ;-)
If CPU bound then cprofile or the tool Glyph mentioned would help.
This advice is from a Linux perspective.
If its IO bound the you may be waiting on a blocking operation. You could use strace on the process to see what its doing.
Good call on strace, will do so next time one of these happens.
As its hung for seconds you could connect gdb to the process and see a python backtrace using the py-bt command. You are likely to catch the offending code. I'm failing miserably to get gdb to find debug symbols for python. I've installed the system python's debug symbols rpm, I've installed gdb .py that ships in the python source tree (so at least I *have* a py-bt command now!) - what am I likely to be missing?
At least on Fedora/Centos when you run up gdb you get a message that tells you the the command to run to install missing symbols. This is what I see on Fedora 30 for example: $ gdb python3 GNU gdb (GDB) Fedora 8.2.91.20190424-24.fc30 ... skip ... For help, type "help". Type "apropos word" to search for commands related to "word"... Reading symbols from python3... (No debugging symbols found in python3) Missing separate debuginfos, use: dnf debuginfo-install python3-3.7.3-1.fc30.x86_64 (gdb) Then I quit gdb and run the sudo dnf as suggested. Then next time you run gdb it will suggest even mode debug info to install. You can often skip that, you have all you need to see python code.
You could adding logging to your code to show the time of callback entry and exit.
Right, but if the callback yields, that will be falsely shown as blocking, no?
I'm not used to using yield with twisted. But it will not block.
cheers,
Chris

On May 14, 2019, at 8:39 AM, Scott, Barry <barry.scott@forcepoint.com> wrote:
On Monday, 13 May 2019 15:26:01 BST Chris Withers wrote:
Hi All,
On some busy instances of the app I'm working with/on, we see websocket connections being dropped due to not ping/ponging in time, I'm fairly sure this is as a result of some user-written callbacks blocking the reactor for long periods of time (seconds...).
What's the best way to debug a twisted application to find out what's blocking the reactor?
Is the process CPU bound or IO bound?
If CPU bound then cprofile or the tool Glyph mentioned would help.
The tool I mentioned (vmprof) is not a CPU profiler, and will help with I/O bound tasks as well. cProfile can also tell you about wall clock time, but has other limitations that might not make it suitable for this. -g
This advice is from a Linux perspective.
If its IO bound the you may be waiting on a blocking operation. You could use strace on the process to see what its doing.
As its hung for seconds you could connect gdb to the process and see a python backtrace using the py-bt command. You are likely to catch the offending code.
The normal bt command will show you what the last python command it calling.
You could adding logging to your code to show the time of callback entry and exit.
If all else fails patch the twisted code to time callbacks and report the slow ones.
Barry
cheers,
Chris
_______________________________________________ Twisted-Python mailing list Twisted-Python@twistedmatrix.com https://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python
_______________________________________________ Twisted-Python mailing list Twisted-Python@twistedmatrix.com https://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python
participants (5)
-
Chris Withers
-
Glyph
-
Maarten ter Huurne
-
Scott, Barry
-
Werner Thie