[IPython-dev] IPCluster failing when starting more than a few engines.

Drain, Theodore R (392P) theodore.r.drain at jpl.nasa.gov
Sat Mar 8 22:41:32 EST 2014


Thanks Burkhard.  I'm going to be launching about 50-200 engines each on a remote machine so waiting a really long time between launches isn't going to be very practical.

After beating my head against the wall for a lot of hours on this, I've finally gotten what I think is a robust system for starting up the cluster using SSH.  There seems to be two issues: 1) if I launch too many engines at once, some of them with fail either with a timeout or a controller purged request error and 2) if an engine launches on a slow or overloaded machine, it may not finish the registration process with the controller in time and be purged.  2) happens even if I launch a single engine at a time and there is no easy way to fix it with the current inputs that I can see.

The first part of the fix is a modification to the HubFactory and Hub classes.  I changed the registration_timeout field which defaults to max(10 sec, 5*heartbeat) into a configuration file input on the HubFactory.  That lets me set it in the profile to a much larger value (90 sec in my case) which then allows slow engines more time to connect without having to make really long heartbeats.

The second part of the script is different launcher logic.  I wrote a script that launches a few engines (4 seems to work well for me) 1 second apart. Then it creates a Client and waits for all the engines to connect.  Then it launches more engines and repeats until all of the engines have launched and connected.

I've submitted an issue https://github.com/ipython/ipython/issues/5302 and the link showing my changes to hub.py which allowed me to get this working.

If MinRK thinks it would be useful, I might try and modify the SSHEngineSetLauncher to have this kind of logic.  It could be a single integer input to indicate current behavior (input=0) or a new behavior which is how many engines to launch before waiting for them to connect (input>0).  

Ted


________________________________________
From: ipython-dev-bounces at scipy.org [ipython-dev-bounces at scipy.org] on behalf of Burkhard Ritter [burkhard at ualberta.ca]
Sent: Friday, March 07, 2014 6:32 PM
To: IPython developers list
Subject: Re: [IPython-dev] IPCluster failing when starting more than a few      engines.

If I remember correctly I also had difficulties bringing all engines
up reliably and it seemed to be due to timing issues with ipcluster.
In the end just wrote my own scripts to start up my engines. My script
does something like this:

```
for ((i=0;i<$N;i++)); do
    nohup nice -n19 ipengine --profile=my_profile
--ssh=controller_node --log-to-file &
    sleep 15
done
```

Most of the time I only have two nodes so I just run these scripts by
hand, but it shouldn't be difficult to extend the script and and start
all engines on a  number of nodes.

Burkhard

On Thu, Mar 6, 2014 at 3:00 PM, Drain, Theodore R (392P)
<theodore.r.drain at jpl.nasa.gov> wrote:
> Sorry to keep spamming the list but...
>
> It appears the problem I'm having is purely timing (or timeout) based.  If I run ipengine by hand after the controller comes up, I can connect more than 5 engines (so it's not a resource problem).  I then tried hacking hub.py which has a line like this:
>
>         self.registration_timeout = max(5000, 2*self.heartmonitor.period)
>
> If I change that to 60000 (60 seconds), I can get a few more engines to connect but it's basically guesswork as to how many make it up.  And there isn't a config option input for that timeout so that isn't much of a solution even if I could come up w/ a time that worked.
>
> At this point I'm thinking I'm going to have to write my own version of "ipcluster" that runs the controller, sets up the port forwards, and spawns the engines.  Perhaps if I have more control over how that happens that I can get a cluster that will reliably start up.
>
> ________________________________________
> From: ipython-dev-bounces at scipy.org [ipython-dev-bounces at scipy.org] on behalf of Drain, Theodore R (392P) [theodore.r.drain at jpl.nasa.gov]
> Sent: Thursday, March 06, 2014 12:51 PM
> To: IPython developers list
> Subject: Re: [IPython-dev] IPCluster failing when starting more than    a       few     engines.
>
> One further bit of information:  I'm hitting a hard limit of 5 engines connecting using SSH port forwarding.  I can run any number of engines locally and it works fine.  Could there be some kind of ZMQ limit or SSH limit?  The host machine does spawn a huge number of processes  - I count 33 processes created when running ipcluster start with a single remote engine which seems a little excessive.
>
> ________________________________________
> From: ipython-dev-bounces at scipy.org [ipython-dev-bounces at scipy.org] on behalf of Drain, Theodore R (392P) [theodore.r.drain at jpl.nasa.gov]
> Sent: Thursday, March 06, 2014 12:37 PM
> To: IPython developers list
> Subject: Re: [IPython-dev] IPCluster failing when starting more than a  few     engines.
>
> Here's some more information.  Hopefully someone can help with this as this problem basically makes IPython parallel unusable.
>
> I had our SA's disable the firewall and then things work fine.  All the engines start up and connect.  With the firewall on, I have to add the line "--enginessh=host" to the controller_args input to enable ssh port forwarding for the connections.  When I do that, if I try to launch a single engine on 30 separate computers (with a shared file system), I can only connect 5 of them even though ipcluster log reports that they all connected fine.
>
> I'm wondering if there is some timing issue w/ running that many SSH port forward calls (it looks like 3 ports per engine are set up).
>
> Any thoughts on what I could try to fix this?
>
> Ted
>
> ________________________________________
> From: ipython-dev-bounces at scipy.org [ipython-dev-bounces at scipy.org] on behalf of Drain, Theodore R (392P) [theodore.r.drain at jpl.nasa.gov]
> Sent: Wednesday, March 05, 2014 11:06 AM
> To: IPython developers list
> Subject: [IPython-dev] IPCluster failing when starting more than a few  engines.
>
> Using IPython 2.0.0 dev branch sync'ed on 2014-02-24 11:44:52.  Running ipcluster start on a set of machines w/o a shared file system using SSHEngineSetLauncher.  I have 6 machines that have between 4 and 12 cores on each machine.  If I run ipcluster with 2 engines/machine, it works fine.  If I increase it to 3 or higher, I start getting engines that fail to connect.
>
> Some failures are failures to connect like look like this:
> 10:43:30.195 [IPEngineApp] Registering with controller at tcp://x.x.x.x:59987
> 10:43:35.196 [IPEngineApp] CRITICAL | Registration timed out after 5.0 seconds
>
> Other failures are weirder:
> 10:43:30.184 [IPEngineApp] Registering with controller at tcp://x.x.x.x:59987
> 10:43:30.249 [IPEngineApp] Starting to monitor the heartbeat signal from the hub every 3010 ms.
> 10:43:30.251 [IPEngineApp] Using existing profile dir: u'.ipython/profile_dev'
> 10:43:30.252 [IPEngineApp] Completed registration with id 6
> 10:43:36.273 [IPEngineApp] WARNING | No heartbeat in the last 3010 ms (1 time(s) in a row).
> 10:43:39.281 [IPEngineApp] WARNING | No heartbeat in the last 3010 ms (2 time(s) in a row).
> 10:43:42.293 [IPEngineApp] WARNING | No heartbeat in the last 3010 ms (3 time(s) in a row).
> 10:44:36.469 [IPEngineApp] WARNING | No heartbeat in the last 3010 ms (1 time(s) in a row).
> 10:44:42.489 [IPEngineApp] WARNING | No heartbeat in the last 3010 ms (1 time(s) in a row).
>
> In the second case, if I connect a client to the controller, there is no engine with ID 6 available even though it seems to be getting some heart beats from the hub.
>
> I've tried adding lines like these to my config file and it doesn't help:
> c.IPClusterStart.delay = 0.5
> c.SSHEngineSetLauncher.delay = 0.5
>
> The number of failures increases with the number of engines being started on each machine.  Trying to start 12 engines on a single machine is almost a complete failure.
>
> Any thoughts on what I should be doing differently?
>
> Thanks,
> Ted
> _______________________________________________
> IPython-dev mailing list
> IPython-dev at scipy.org
> http://mail.scipy.org/mailman/listinfo/ipython-dev
> _______________________________________________
> IPython-dev mailing list
> IPython-dev at scipy.org
> http://mail.scipy.org/mailman/listinfo/ipython-dev
> _______________________________________________
> IPython-dev mailing list
> IPython-dev at scipy.org
> http://mail.scipy.org/mailman/listinfo/ipython-dev
> _______________________________________________
> IPython-dev mailing list
> IPython-dev at scipy.org
> http://mail.scipy.org/mailman/listinfo/ipython-dev
>
_______________________________________________
IPython-dev mailing list
IPython-dev at scipy.org
http://mail.scipy.org/mailman/listinfo/ipython-dev


More information about the IPython-dev mailing list