[IPython-dev] debugging remote engine death
Moritz Emanuel Beber
moritz.beber at gmail.com
Sat Oct 18 07:50:50 EDT 2014
Hi all,
I'd like to ask for your help in debugging a remote error. In case it
matters, I'm running this on:
Python 2.7.5
ipython 2.3.0
pyzmq 14.3.1
not sure about the libzmq version but it was pulled in with pyzmq and
should be recent.
So I'm using IPython.parallel with default profile and settings. The
error I'm getting is the following:
Traceback (most recent call last):
File "scripts/trn_randomization_analysis.py", line 314, in <module>
sys.exit(args.func(remote_client, args))
File "scripts/trn_randomization_analysis.py", line 259, in main_analysis
for df in res_it:
File "/home/mbeber/.virtualenvs/control/local/lib/python2.7/site-packages/IPython/parallel/client/asyncresult.py", line 594, in __iter__
for r in it():
File "/home/mbeber/.virtualenvs/control/local/lib/python2.7/site-packages/IPython/parallel/client/asyncresult.py", line 641, in _unordered_iter
rlist = ar.get()
File "/home/mbeber/.virtualenvs/control/local/lib/python2.7/site-packages/IPython/parallel/client/asyncresult.py", line 118, in get
raise self._exception
IPython.parallel.error.RemoteError: EngineError(Engine 'bd38ee8e-ad65-41af-944d-a9ea15162c03' died while running task u'a392e922-e01f-41a7-9e5e-def08ba61da8')
So my first question/concern here is: A single engine has died.
Shouldn't my main process just keep running and reschedule the task to a
different engine?
I suspect that it may be a memory problem, so I wanted to inspect the
logs. However, the log names are of a different format. Their names are
ipcontroller-<number>.log or ipengine-<number>.log. I guess the numbers
correspond to the PIDs? Either way, I'm not sure how to find that number
after the engine has died and the main process has terminated. The
cluster is still running, so maybe that is of help?
So instead I did a grep on all logs for the engine UUID:
cat * | grep bd38ee8e-ad65-41af-944d-a9ea15162c03
2014-10-12 20:18:26.477 [IPControllerApp] client::client 'bd38ee8e-ad65-41af-944d-a9ea15162c03' requested u'registration_request'
2014-10-12 20:18:26.529 [IPControllerApp] WARNING | iopub::IOPub message lacks parent: {'parent_header': {}, 'msg_type': u'status', 'msg_id': u'042f9a0a-61c9-4238-aee5-3a6ea89596e0', 'content': {u'execution_state': u'starting'}, 'header': {u'date': datetime.datetime(2014, 10, 12, 20, 18, 26, 529317), u'username': u'mbeber', u'session': u'bd38ee8e-ad65-41af-944d-a9ea15162c03', u'msg_id': u'042f9a0a-61c9-4238-aee5-3a6ea89596e0', u'msg_type': u'status'}, 'buffers': [], 'metadata': {}}
2014-10-12 20:18:31.693 [IPControllerApp] registration::finished registering engine 0:bd38ee8e-ad65-41af-944d-a9ea15162c03
2014-10-12 22:28:04.693 [IPControllerApp] heartbeat::missed bd38ee8e-ad65-41af-944d-a9ea15162c03 : 1
2014-10-12 22:28:07.693 [IPControllerApp] heartbeat::missed bd38ee8e-ad65-41af-944d-a9ea15162c03 : 2
2014-10-12 22:28:10.692 [IPControllerApp] heartbeat::missed bd38ee8e-ad65-41af-944d-a9ea15162c03 : 3
2014-10-12 22:28:13.693 [IPControllerApp] heartbeat::missed bd38ee8e-ad65-41af-944d-a9ea15162c03 : 4
2014-10-12 22:28:16.692 [IPControllerApp] heartbeat::missed bd38ee8e-ad65-41af-944d-a9ea15162c03 : 5
2014-10-12 22:28:19.692 [IPControllerApp] heartbeat::missed bd38ee8e-ad65-41af-944d-a9ea15162c03 : 6
2014-10-12 22:28:22.693 [IPControllerApp] heartbeat::missed bd38ee8e-ad65-41af-944d-a9ea15162c03 : 7
2014-10-12 22:28:25.692 [IPControllerApp] heartbeat::missed bd38ee8e-ad65-41af-944d-a9ea15162c03 : 8
2014-10-12 22:28:28.693 [IPControllerApp] heartbeat::missed bd38ee8e-ad65-41af-944d-a9ea15162c03 : 9
2014-10-12 22:28:31.692 [IPControllerApp] heartbeat::missed bd38ee8e-ad65-41af-944d-a9ea15162c03 : 10
2014-10-12 22:28:34.692 [IPControllerApp] heartbeat::missed bd38ee8e-ad65-41af-944d-a9ea15162c03 : 11
So nothing found that tells me much more. Can anyone answer some of my
question and/or recommend a strategy for further investigating this problem?
Thank you in advance,
Moritz
More information about the IPython-dev
mailing list