[IPython-dev] Parallel SSH questions

Tue Jan 14 18:46:36 EST 2014

I've got some questions on how SSH engines work in the parallel system (running ipython 1.1.0).  This work is done on  cluster with a shared file system (hosts node1...node#) and all of the ports are firewalled so I have to use ssh tunnels.  I've created a profile named "ssh" and my ipcluster_config.py looks like this:

import platform; host = platform.node()

c.IPClusterStart.controller_launcher_class = 'SSHControllerLauncher'
c.SSHControllerLauncher.hostname = host
c.SSHControllerLauncher.controller_args = [
   '--enginessh=%s'%host,   '--ssh=%s'%host,
   '--log-to-file', '--log-level=20']

c.IPClusterStart.engine_launcher_class = 'SSHEngineSetLauncher'
c.SSHEngineSetLauncher.engine_args = [
   '--profile=ssh', '--log-to-file', '--log-level=20' ]
c.SSHEngineSetLauncher.engines = {
   'node2' : 2,
   'node3' : 2,
   }

Question #1:  If I don't add the --enginessh line to the controller inputs, then the ssh field in ipcontroller-engine.json is never filled in and the engines won't start.   If I don't add the --ssh line to the controller inputs, then I can't connect to the controller from a remote client because the ssh field in the ipcontroler-client.json is blank. Adding the hostname look up shown above solves this.  Is this the expected behavior?  Seems like a little bit of a hack that shouldn't be required to get forwarding to work - is there a better way to handle this?

Question #2: Is there any way to map engine ID (or UUID) to host name?  Since everything is connected w/ SSH port forwarding I can see how the controller might not know this but it would make for a lot nicer status output.  I'm using the db_query() routine to report how many jobs are completed and waiting for each engine and if there are engines that are engines that are really slow, I'd like to report them by host so the user can track down what's going on.

Question #3: In my environment (lots of people using the same cluster), parallel jobs are going to be set at a high nice level so they don't preempt interactive users.  This can mean that if another users starts jobs on a node, the parallel jobs assigned to that job might take a long time to finish.  I'd like to have a controller that handles this.  One idea is to have a controller that reassigns pending jobs to other engines when all other jobs are completed (and accepts the results from whichever engine finishes first).  Has anyone done anything like this or have tips on where to start with it?

Thanks,
Ted