[IPython-dev] debugging remote engine death

Moritz Beber moritz.beber at gmail.com
Wed Oct 22 18:02:39 EDT 2014


Can anyone at least tell how to find the right engine log to look into
given the uuid of the engine, please?

On Sat, Oct 18, 2014 at 1:50 PM, Moritz Emanuel Beber <
moritz.beber at gmail.com> wrote:

> Hi all,
>
> I'd like to ask for your help in debugging a remote error. In case it
> matters, I'm running this on:
>
> Python 2.7.5
>
> ipython 2.3.0
>
> pyzmq 14.3.1
>
>
> not sure about the libzmq version but it was pulled in with pyzmq and
> should be recent.
>
> So I'm using IPython.parallel with default profile and settings. The error
> I'm getting is the following:
>
> Traceback (most recent call last):
>
>   File "scripts/trn_randomization_analysis.py", line 314, in <module>
>
>     sys.exit(args.func(remote_client, args))
>
>   File "scripts/trn_randomization_analysis.py", line 259, in main_analysis
>
>     for df in res_it:
>
>   File "/home/mbeber/.virtualenvs/control/local/lib/python2.7/
> site-packages/IPython/parallel/client/asyncresult.py", line 594, in
> __iter__
>
>     for r in it():
>
>   File "/home/mbeber/.virtualenvs/control/local/lib/python2.7/
> site-packages/IPython/parallel/client/asyncresult.py", line 641, in
> _unordered_iter
>
>     rlist = ar.get()
>
>   File "/home/mbeber/.virtualenvs/control/local/lib/python2.7/
> site-packages/IPython/parallel/client/asyncresult.py", line 118, in get
>
>     raise self._exception
>
> IPython.parallel.error.RemoteError: EngineError(Engine
> 'bd38ee8e-ad65-41af-944d-a9ea15162c03' died while running task
> u'a392e922-e01f-41a7-9e5e-def08ba61da8')
>
> So my first question/concern here is: A single engine has died. Shouldn't
> my main process just keep running and reschedule the task to a different
> engine?
>
> I suspect that it may be a memory problem, so I wanted to inspect the
> logs. However, the log names are of a different format. Their names are
> ipcontroller-<number>.log or ipengine-<number>.log. I guess the numbers
> correspond to the PIDs? Either way, I'm not sure how to find that number
> after the engine has died and the main process has terminated. The cluster
> is still running, so maybe that is of help?
>
> So instead I did a grep on all logs for the engine UUID:
>
> cat * | grep bd38ee8e-ad65-41af-944d-a9ea15162c03
>
> 2014-10-12 20:18:26.477 [IPControllerApp] client::client
> 'bd38ee8e-ad65-41af-944d-a9ea15162c03' requested u'registration_request'
>
> 2014-10-12 20:18:26.529 [IPControllerApp] WARNING | iopub::IOPub message
> lacks parent: {'parent_header': {}, 'msg_type': u'status', 'msg_id':
> u'042f9a0a-61c9-4238-aee5-3a6ea89596e0', 'content': {u'execution_state':
> u'starting'}, 'header': {u'date': datetime.datetime(2014, 10, 12, 20, 18,
> 26, 529317), u'username': u'mbeber', u'session': u'bd38ee8e-ad65-41af-944d-a9ea15162c03',
> u'msg_id': u'042f9a0a-61c9-4238-aee5-3a6ea89596e0', u'msg_type':
> u'status'}, 'buffers': [], 'metadata': {}}
>
> 2014-10-12 20:18:31.693 [IPControllerApp] registration::finished
> registering engine 0:bd38ee8e-ad65-41af-944d-a9ea15162c03
>
> 2014-10-12 22:28:04.693 [IPControllerApp] heartbeat::missed
> bd38ee8e-ad65-41af-944d-a9ea15162c03 : 1
>
> 2014-10-12 22:28:07.693 [IPControllerApp] heartbeat::missed
> bd38ee8e-ad65-41af-944d-a9ea15162c03 : 2
>
> 2014-10-12 22:28:10.692 [IPControllerApp] heartbeat::missed
> bd38ee8e-ad65-41af-944d-a9ea15162c03 : 3
>
> 2014-10-12 22:28:13.693 [IPControllerApp] heartbeat::missed
> bd38ee8e-ad65-41af-944d-a9ea15162c03 : 4
>
> 2014-10-12 22:28:16.692 [IPControllerApp] heartbeat::missed
> bd38ee8e-ad65-41af-944d-a9ea15162c03 : 5
>
> 2014-10-12 22:28:19.692 [IPControllerApp] heartbeat::missed
> bd38ee8e-ad65-41af-944d-a9ea15162c03 : 6
>
> 2014-10-12 22:28:22.693 [IPControllerApp] heartbeat::missed
> bd38ee8e-ad65-41af-944d-a9ea15162c03 : 7
>
> 2014-10-12 22:28:25.692 [IPControllerApp] heartbeat::missed
> bd38ee8e-ad65-41af-944d-a9ea15162c03 : 8
>
> 2014-10-12 22:28:28.693 [IPControllerApp] heartbeat::missed
> bd38ee8e-ad65-41af-944d-a9ea15162c03 : 9
>
> 2014-10-12 22:28:31.692 [IPControllerApp] heartbeat::missed
> bd38ee8e-ad65-41af-944d-a9ea15162c03 : 10
>
> 2014-10-12 22:28:34.692 [IPControllerApp] heartbeat::missed
> bd38ee8e-ad65-41af-944d-a9ea15162c03 : 11
>
>
> So nothing found that tells me much more. Can anyone answer some of my
> question and/or recommend a strategy for further investigating this problem?
>
> Thank you in advance,
> Moritz
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/ipython-dev/attachments/20141023/7c03fb21/attachment.html>


More information about the IPython-dev mailing list