[IPython-dev] Do I need to write a new parallel engine launcher class?
Drain, Theodore R (392P)
theodore.r.drain at jpl.nasa.gov
Tue Jan 14 13:31:19 EST 2014
I'm providing software to a group of users to make it easier for them to do simple parallel jobs. The environment is a cluster of machines with a network storage system (shared file system). I can use SSHEngineSetLauncher to launch a collection of engines on the cluster nodes and everything works fine but I'd like to handle the case where an engine can't be started for whatever reason. Currently, the caller is given an error message that's hard to understand and the engine spawning stops leaving a partial set of engines running.
What I'd like to provide them is a command line tool (pengines) that does something like this:
host0> pengines start --profile=cluster
Starting controller on host0
Starting engine 0 on host1
Starting engine 1 on host1
Starting engine 2 on host2 - FAILED
See log /somepath/tolog/file
Starting engine 3 on host3
Starting engine 4 on host4
Finished
5 engines available
host0> pengines status
Controller running on host0
5 engines available
host0> pengines stop
Stopping 5 engines...
The problem I'm running in to is that the existing launchers dump a lot of text to the screen and the engine set launcher stops running if a single engine can't be started. I'm looking for suggestions as to the best way to "fix" this. I think I need to write a new launcher that's similar to SSHEngineSetLauncher which handles errors and provides much simpler output. Any suggestions would be appreciated.
Thanks,
Ted
More information about the IPython-dev
mailing list