[IPython-dev] Problems starting parallel nodes with ssh launch

John.Holt at tessella.com John.Holt at tessella.com
Tue May 5 11:56:36 EDT 2015


Hi,

I  am working on a project which is utilising what was IPython.parallel  and I have been running into unexpected problems on startup of the nodes  from the workbook. The problem seem to split into two distinct groups:
1) Nodes not completing initial registration process, log reads
    2015-04-30 15:58:43.435 [IPEngineApp] Loading url_file u'.ipython/profile_default/security/ipcontroller-engine.json'
    2015-04-30 15:58:43.455 [IPEngineApp] Registering with controller at tcp://192.168.3.18:42527

2) Nodes starting but failing to get through the registration, log reads:
    2015-04-30 15:58:59.514 [IPEngineApp] Loading url_file u'.ipython/profile_default/security/ipcontroller-engine.json'
    2015-04-30 15:58:59.530 [IPEngineApp] Registering with controller at tcp://192.168.3.18:42527
    2015-04-30 15:59:03.793 [IPEngineApp] complete registration started

In  both cases the nodes don't automatically terminate themselves. The  problem can be fixed by making the delay between launching nodes larger.  I did a bit of digging and it appears that the first problem is an  exception that is being thrown in connect() (within zmq) which is called  from register(). The exception is:

    2015-04-30 15:58:43.760 [IPEngineApp] ERROR | problem with connect
    Traceback (most recent call last):
      File "/var/share/jupyter/virtualenv_2.7/src/ipython/IPython/parallel/engine/engine.py", line 147, in register
        connect(reg, self.url)
      File "/var/share/jupyter/virtualenv_2.7/src/ipython/IPython/parallel/engine/engine.py", line 116, in connect
        password=password,
      File "/var/share/jupyter/virtualenv_2.7/lib/python2.7/site-packages/zmq/ssh/tunnel.py", line 134, in tunnel_connection
        new_url, tunnel = open_tunnel(addr, server, keyfile=keyfile, password=password, paramiko=paramiko, timeout=timeout)
      File "/var/share/jupyter/virtualenv_2.7/lib/python2.7/site-packages/zmq/ssh/tunnel.py", line 162, in open_tunnel
        tunnel = tunnelf(lport, rport, server, remoteip=ip, keyfile=keyfile, password=password, timeout=timeout)
      File "/var/share/jupyter/virtualenv_2.7/lib/python2.7/site-packages/zmq/ssh/tunnel.py", line 240, in openssh_tunnel
        raise RuntimeError("tunnel '%s' failed to start"%(cmd))
     RuntimeError: tunnel 'ssh -i ~/.ssh/id_rsa -f -S none -L  127.0.0.1:53305:192.168.3.18:42527 192.168.3.18 sleep 60' failed to  start

This is probably caused by running multiple engines on my  node (it is a multi core node). If one of the engines starts before the  next one completes there is a race condition where the first engines  connects to a port after the second engine determines that this is a  free port, then when the second engine connects it finds the port isn't  free. This exception is not caught and logged (or at least I can not  find a log) and I added the exception catch to get the above output.  Would it be possible to add a general exception catch and log to the  register function? The next problem was that because the abort timer is  not created until after the register is called the process never exits,  but instead sits doing nothing.

The second problem is more  mysterious to me because I did less debugging so I am unsure what is  going on; I would guess it is something similar. The  complete_registration function is crashing at some point. I placed  logging points through the function and it seemed to exit/hang at a  number of points: 
1) launching the heartbeat
2) creating the Shell Connections
3) creating the control stream

The  abort process is unregistered at the top of the registration_complete  function so this process never exits. I am unsure whether this is  throwing an exception (I may be able to look into this if it is  important).

So to summarise I think it would be great to have  exception catching and logging around both the register and  complete_registration functions. It would also be good to make sure that  the abort loop is started before register and stopped at the end of  compete_registation so that if an error does occur (including it just  spinning) then the process will exit. However I may have misinterpreted  the code so please let me know if I am doing something incorrect.

Thank you for you help.

JohnThis message is commercial in confidence and may be privileged.  It is intended for the 
addressee(s) only.  Access to this message by anyone else is unauthorized and strictly prohibited.  
If you have received this message in error, please inform the sender immediately.  Please note that 
messages sent or received by the Tessella e-mail system may be monitored and stored in an 
information retrieval system.





More information about the IPython-dev mailing list