[IPython-dev] Using parallel on a particular cluster

Thu Dec 4 11:56:57 EST 2014

Hi,
I would very much like to start using IPython.Parallel on our cluster. 
However, I'm having a hard time getting going.  Let me explain the 
structure of the cluster (as well as I understand it), and my guess 
about where and how things should be run and structured.  Then maybe 
somebody can help me fill in the gaps.

I have a machine that is owned by our research group, located on campus. 
  I have root on this machine, and can install and run anything I like 
(within reason).  No limits on time, CPU, or memory usage beyond the 
limits of the hardware itself.

The cluster has two login nodes, login01 and login02.  Processes on the 
login nodes are killed after about half an hour.  These machines are 
visible from outside the cluster.  Key-pair ssh authentication is 
disabled for some reason, and I think it would be quite a fight to get 
it enabled, even just for my account.

The cluster uses SLURM for scheduling, and jobs are submitted to a 
handful of queues from the login nodes.

The cluster has lots of compute nodes.  Processes that are idle for too 
long are killed.

Network access to most machines outside the cluster is prohibited.  It 
seems that some special cases are whitelisted.  You can talk to the 
login nodes, and a few other sites around the country.  Machines that 
are on-campus but not part of the cluster are NOT in general permitted, 
including our research group's machines.

I do have IPython installed on the cluster, via anaconda.

So, obviously I want to have engines running on the compute nodes.  I 
want the notebook server and the primary kernel to run on our group's 
machine, occasionally spinning up some engines and submitting stuff to 
them, as needed.  Then once the heavy lifting is complete and the 
results returned to the primary kernel, I would shut the engines down 
again and relinquish those compute nodes.

I think that the hub and the schedulers should also run on our group's 
machine.  The login node would be the obvious choice, but long-running 
processes are killed there, so I don't think that it will work.

The engines need to be able to talk to the hub and to the schedulers.  I 
suppose that ssh tunnels are probably the best way to do this.  Since 
our group's machine can't see the compute nodes, and the compute nodes 
aren't allowed to talk to our group's machine, I think I will have to 
request an exception be made to allow them to talk to our group's 
machine.  I hope that this will be granted.

Assuming that it is, the first thing the engines should do is to 
establish ssh tunnels (in both directions) to our machine.

Here is where I get a bit lost.  I don't know how to configure things. 
When I "start" the cluster, I guess the hub and schedulers can just 
start locally, so that's probably easy to configure.  To start the 
engines, I need to ssh to the login node and run "srun" with an 
appropriate SLURM script.  This ssh needs manual intervention: my 
password.  I'm guessing that this is harder to configure.  And where 
will my password be requested?  In the console where the notebook server 
is running?

Then, after waiting in the queue, the engines start one by one.  They 
make the ssh tunnels, and then what: do they attempt to contact the hub 
and schedulers?

Do the hub and schedulers wait for all the engines to become available 
before the cluster is ready to use?  Or can I start submitting work to 
engines as soon as they come online?

This looks like a really useful tool, but I'm struggling to figure out 
how to start using it.
Regards,
Jon