[IPython-dev] debugging remote engine death

Moritz Emanuel Beber moritz.beber at gmail.com
Sat Oct 18 07:50:50 EDT 2014


Hi all,

I'd like to ask for your help in debugging a remote error. In case it 
matters, I'm running this on:

Python 2.7.5

ipython 2.3.0

pyzmq 14.3.1


not sure about the libzmq version but it was pulled in with pyzmq and 
should be recent.

So I'm using IPython.parallel with default profile and settings. The 
error I'm getting is the following:

Traceback (most recent call last):

   File "scripts/trn_randomization_analysis.py", line 314, in <module>

     sys.exit(args.func(remote_client, args))

   File "scripts/trn_randomization_analysis.py", line 259, in main_analysis

     for df in res_it:

   File "/home/mbeber/.virtualenvs/control/local/lib/python2.7/site-packages/IPython/parallel/client/asyncresult.py", line 594, in __iter__

     for r in it():

   File "/home/mbeber/.virtualenvs/control/local/lib/python2.7/site-packages/IPython/parallel/client/asyncresult.py", line 641, in _unordered_iter

     rlist = ar.get()

   File "/home/mbeber/.virtualenvs/control/local/lib/python2.7/site-packages/IPython/parallel/client/asyncresult.py", line 118, in get

     raise self._exception

IPython.parallel.error.RemoteError: EngineError(Engine 'bd38ee8e-ad65-41af-944d-a9ea15162c03' died while running task u'a392e922-e01f-41a7-9e5e-def08ba61da8')

So my first question/concern here is: A single engine has died. 
Shouldn't my main process just keep running and reschedule the task to a 
different engine?

I suspect that it may be a memory problem, so I wanted to inspect the 
logs. However, the log names are of a different format. Their names are 
ipcontroller-<number>.log or ipengine-<number>.log. I guess the numbers 
correspond to the PIDs? Either way, I'm not sure how to find that number 
after the engine has died and the main process has terminated. The 
cluster is still running, so maybe that is of help?

So instead I did a grep on all logs for the engine UUID:

cat * | grep bd38ee8e-ad65-41af-944d-a9ea15162c03

2014-10-12 20:18:26.477 [IPControllerApp] client::client 'bd38ee8e-ad65-41af-944d-a9ea15162c03' requested u'registration_request'

2014-10-12 20:18:26.529 [IPControllerApp] WARNING | iopub::IOPub message lacks parent: {'parent_header': {}, 'msg_type': u'status', 'msg_id': u'042f9a0a-61c9-4238-aee5-3a6ea89596e0', 'content': {u'execution_state': u'starting'}, 'header': {u'date': datetime.datetime(2014, 10, 12, 20, 18, 26, 529317), u'username': u'mbeber', u'session': u'bd38ee8e-ad65-41af-944d-a9ea15162c03', u'msg_id': u'042f9a0a-61c9-4238-aee5-3a6ea89596e0', u'msg_type': u'status'}, 'buffers': [], 'metadata': {}}

2014-10-12 20:18:31.693 [IPControllerApp] registration::finished registering engine 0:bd38ee8e-ad65-41af-944d-a9ea15162c03

2014-10-12 22:28:04.693 [IPControllerApp] heartbeat::missed bd38ee8e-ad65-41af-944d-a9ea15162c03 : 1

2014-10-12 22:28:07.693 [IPControllerApp] heartbeat::missed bd38ee8e-ad65-41af-944d-a9ea15162c03 : 2

2014-10-12 22:28:10.692 [IPControllerApp] heartbeat::missed bd38ee8e-ad65-41af-944d-a9ea15162c03 : 3

2014-10-12 22:28:13.693 [IPControllerApp] heartbeat::missed bd38ee8e-ad65-41af-944d-a9ea15162c03 : 4

2014-10-12 22:28:16.692 [IPControllerApp] heartbeat::missed bd38ee8e-ad65-41af-944d-a9ea15162c03 : 5

2014-10-12 22:28:19.692 [IPControllerApp] heartbeat::missed bd38ee8e-ad65-41af-944d-a9ea15162c03 : 6

2014-10-12 22:28:22.693 [IPControllerApp] heartbeat::missed bd38ee8e-ad65-41af-944d-a9ea15162c03 : 7

2014-10-12 22:28:25.692 [IPControllerApp] heartbeat::missed bd38ee8e-ad65-41af-944d-a9ea15162c03 : 8

2014-10-12 22:28:28.693 [IPControllerApp] heartbeat::missed bd38ee8e-ad65-41af-944d-a9ea15162c03 : 9

2014-10-12 22:28:31.692 [IPControllerApp] heartbeat::missed bd38ee8e-ad65-41af-944d-a9ea15162c03 : 10

2014-10-12 22:28:34.692 [IPControllerApp] heartbeat::missed bd38ee8e-ad65-41af-944d-a9ea15162c03 : 11


So nothing found that tells me much more. Can anyone answer some of my 
question and/or recommend a strategy for further investigating this problem?

Thank you in advance,
Moritz




More information about the IPython-dev mailing list