[IPython-dev] Using parallel on a particular cluster
Jon Wilson
jsw at fnal.gov
Thu Dec 4 11:56:57 EST 2014
Hi,
I would very much like to start using IPython.Parallel on our cluster.
However, I'm having a hard time getting going. Let me explain the
structure of the cluster (as well as I understand it), and my guess
about where and how things should be run and structured. Then maybe
somebody can help me fill in the gaps.
I have a machine that is owned by our research group, located on campus.
I have root on this machine, and can install and run anything I like
(within reason). No limits on time, CPU, or memory usage beyond the
limits of the hardware itself.
The cluster has two login nodes, login01 and login02. Processes on the
login nodes are killed after about half an hour. These machines are
visible from outside the cluster. Key-pair ssh authentication is
disabled for some reason, and I think it would be quite a fight to get
it enabled, even just for my account.
The cluster uses SLURM for scheduling, and jobs are submitted to a
handful of queues from the login nodes.
The cluster has lots of compute nodes. Processes that are idle for too
long are killed.
Network access to most machines outside the cluster is prohibited. It
seems that some special cases are whitelisted. You can talk to the
login nodes, and a few other sites around the country. Machines that
are on-campus but not part of the cluster are NOT in general permitted,
including our research group's machines.
I do have IPython installed on the cluster, via anaconda.
So, obviously I want to have engines running on the compute nodes. I
want the notebook server and the primary kernel to run on our group's
machine, occasionally spinning up some engines and submitting stuff to
them, as needed. Then once the heavy lifting is complete and the
results returned to the primary kernel, I would shut the engines down
again and relinquish those compute nodes.
I think that the hub and the schedulers should also run on our group's
machine. The login node would be the obvious choice, but long-running
processes are killed there, so I don't think that it will work.
The engines need to be able to talk to the hub and to the schedulers. I
suppose that ssh tunnels are probably the best way to do this. Since
our group's machine can't see the compute nodes, and the compute nodes
aren't allowed to talk to our group's machine, I think I will have to
request an exception be made to allow them to talk to our group's
machine. I hope that this will be granted.
Assuming that it is, the first thing the engines should do is to
establish ssh tunnels (in both directions) to our machine.
Here is where I get a bit lost. I don't know how to configure things.
When I "start" the cluster, I guess the hub and schedulers can just
start locally, so that's probably easy to configure. To start the
engines, I need to ssh to the login node and run "srun" with an
appropriate SLURM script. This ssh needs manual intervention: my
password. I'm guessing that this is harder to configure. And where
will my password be requested? In the console where the notebook server
is running?
Then, after waiting in the queue, the engines start one by one. They
make the ssh tunnels, and then what: do they attempt to contact the hub
and schedulers?
Do the hub and schedulers wait for all the engines to become available
before the cluster is ready to use? Or can I start submitting work to
engines as soon as they come online?
This looks like a really useful tool, but I'm struggling to figure out
how to start using it.
Regards,
Jon
More information about the IPython-dev
mailing list