[IPython-dev] IPCluster failing when starting more than a few engines.
Drain, Theodore R (392P)
theodore.r.drain at jpl.nasa.gov
Wed Mar 5 14:06:00 EST 2014
Using IPython 2.0.0 dev branch sync'ed on 2014-02-24 11:44:52. Running ipcluster start on a set of machines w/o a shared file system using SSHEngineSetLauncher. I have 6 machines that have between 4 and 12 cores on each machine. If I run ipcluster with 2 engines/machine, it works fine. If I increase it to 3 or higher, I start getting engines that fail to connect.
Some failures are failures to connect like look like this:
10:43:30.195 [IPEngineApp] Registering with controller at tcp://x.x.x.x:59987
10:43:35.196 [IPEngineApp] CRITICAL | Registration timed out after 5.0 seconds
Other failures are weirder:
10:43:30.184 [IPEngineApp] Registering with controller at tcp://x.x.x.x:59987
10:43:30.249 [IPEngineApp] Starting to monitor the heartbeat signal from the hub every 3010 ms.
10:43:30.251 [IPEngineApp] Using existing profile dir: u'.ipython/profile_dev'
10:43:30.252 [IPEngineApp] Completed registration with id 6
10:43:36.273 [IPEngineApp] WARNING | No heartbeat in the last 3010 ms (1 time(s) in a row).
10:43:39.281 [IPEngineApp] WARNING | No heartbeat in the last 3010 ms (2 time(s) in a row).
10:43:42.293 [IPEngineApp] WARNING | No heartbeat in the last 3010 ms (3 time(s) in a row).
10:44:36.469 [IPEngineApp] WARNING | No heartbeat in the last 3010 ms (1 time(s) in a row).
10:44:42.489 [IPEngineApp] WARNING | No heartbeat in the last 3010 ms (1 time(s) in a row).
In the second case, if I connect a client to the controller, there is no engine with ID 6 available even though it seems to be getting some heart beats from the hub.
I've tried adding lines like these to my config file and it doesn't help:
c.IPClusterStart.delay = 0.5
c.SSHEngineSetLauncher.delay = 0.5
The number of failures increases with the number of engines being started on each machine. Trying to start 12 engines on a single machine is almost a complete failure.
Any thoughts on what I should be doing differently?
Thanks,
Ted
More information about the IPython-dev
mailing list