[IPython-dev] Problems starting parallel nodes with ssh launch
John.Holt at tessella.com
John.Holt at tessella.com
Tue May 5 11:56:36 EDT 2015
Hi,
I am working on a project which is utilising what was IPython.parallel and I have been running into unexpected problems on startup of the nodes from the workbook. The problem seem to split into two distinct groups:
1) Nodes not completing initial registration process, log reads
2015-04-30 15:58:43.435 [IPEngineApp] Loading url_file u'.ipython/profile_default/security/ipcontroller-engine.json'
2015-04-30 15:58:43.455 [IPEngineApp] Registering with controller at tcp://192.168.3.18:42527
2) Nodes starting but failing to get through the registration, log reads:
2015-04-30 15:58:59.514 [IPEngineApp] Loading url_file u'.ipython/profile_default/security/ipcontroller-engine.json'
2015-04-30 15:58:59.530 [IPEngineApp] Registering with controller at tcp://192.168.3.18:42527
2015-04-30 15:59:03.793 [IPEngineApp] complete registration started
In both cases the nodes don't automatically terminate themselves. The problem can be fixed by making the delay between launching nodes larger. I did a bit of digging and it appears that the first problem is an exception that is being thrown in connect() (within zmq) which is called from register(). The exception is:
2015-04-30 15:58:43.760 [IPEngineApp] ERROR | problem with connect
Traceback (most recent call last):
File "/var/share/jupyter/virtualenv_2.7/src/ipython/IPython/parallel/engine/engine.py", line 147, in register
connect(reg, self.url)
File "/var/share/jupyter/virtualenv_2.7/src/ipython/IPython/parallel/engine/engine.py", line 116, in connect
password=password,
File "/var/share/jupyter/virtualenv_2.7/lib/python2.7/site-packages/zmq/ssh/tunnel.py", line 134, in tunnel_connection
new_url, tunnel = open_tunnel(addr, server, keyfile=keyfile, password=password, paramiko=paramiko, timeout=timeout)
File "/var/share/jupyter/virtualenv_2.7/lib/python2.7/site-packages/zmq/ssh/tunnel.py", line 162, in open_tunnel
tunnel = tunnelf(lport, rport, server, remoteip=ip, keyfile=keyfile, password=password, timeout=timeout)
File "/var/share/jupyter/virtualenv_2.7/lib/python2.7/site-packages/zmq/ssh/tunnel.py", line 240, in openssh_tunnel
raise RuntimeError("tunnel '%s' failed to start"%(cmd))
RuntimeError: tunnel 'ssh -i ~/.ssh/id_rsa -f -S none -L 127.0.0.1:53305:192.168.3.18:42527 192.168.3.18 sleep 60' failed to start
This is probably caused by running multiple engines on my node (it is a multi core node). If one of the engines starts before the next one completes there is a race condition where the first engines connects to a port after the second engine determines that this is a free port, then when the second engine connects it finds the port isn't free. This exception is not caught and logged (or at least I can not find a log) and I added the exception catch to get the above output. Would it be possible to add a general exception catch and log to the register function? The next problem was that because the abort timer is not created until after the register is called the process never exits, but instead sits doing nothing.
The second problem is more mysterious to me because I did less debugging so I am unsure what is going on; I would guess it is something similar. The complete_registration function is crashing at some point. I placed logging points through the function and it seemed to exit/hang at a number of points:
1) launching the heartbeat
2) creating the Shell Connections
3) creating the control stream
The abort process is unregistered at the top of the registration_complete function so this process never exits. I am unsure whether this is throwing an exception (I may be able to look into this if it is important).
So to summarise I think it would be great to have exception catching and logging around both the register and complete_registration functions. It would also be good to make sure that the abort loop is started before register and stopped at the end of compete_registation so that if an error does occur (including it just spinning) then the process will exit. However I may have misinterpreted the code so please let me know if I am doing something incorrect.
Thank you for you help.
JohnThis message is commercial in confidence and may be privileged. It is intended for the
addressee(s) only. Access to this message by anyone else is unauthorized and strictly prohibited.
If you have received this message in error, please inform the sender immediately. Please note that
messages sent or received by the Tessella e-mail system may be monitored and stored in an
information retrieval system.
More information about the IPython-dev
mailing list