[IPython-dev] Ipython parallel and PBS
James
jamesresearching at gmail.com
Fri Sep 13 06:38:30 EDT 2013
Dear all,
I'm having a lot of trouble setting up IPython parallel on a PBS cluster,
and I would really appreciate any help.
The architecture is a standard PBS cluster - a head node with slave nodes.
I connect to the head node from my laptop over ssh.
The client (laptop) -> Head node connection seems simple enough. The
problem is the engines.
Ignoring the laptop for a moment, I'll just focus on running ipython on the
head node, with the engines on a slave node. I assume this is a correct
method of working?
I did the following on the head node, following instructions at
http://ipython.org/ipython-doc/stable/parallel/parallel_process.html#using-ipcluster-in-pbs-mode:
$ ipython profile create --parallel --profile=pbs
Files are as follows:
$cat ipcluster_config.py
c = get_config()
c.IPClusterStart.controller_launcher_class = 'PBSControllerLauncher'
c.IPClusterEngines.engine_launcher_class = 'PBSEngineSetLauncher'
c.PBSLauncher.queue = 'long'
c.IPClusterEngines.n = 2 # Run 2 cores on 1 node or 2 nodes with all cores?
Not sure.
$ cat ipengine_config.py
c = get_config()
Then execute on the head node:
$ ipcluster start --profile=pbs -n 2
2013-09-10 15:02:46,771.771 [IPClusterStart] Using existing profile dir:
u'/home/username/.ipython/profile_pbs'
2013-09-10 15:02:46.777 [IPClusterStart] Starting ipcluster with
[daemon=False]
2013-09-10 15:02:46.778 [IPClusterStart] Creating pid file:
/home/username/.ipython/profile_pbs/pid/ipcluster.pid
2013-09-10 15:02:46.778 [IPClusterStart] Starting Controller with
PBSControllerLauncher
2013-09-10 15:02:46.792 [IPClusterStart] Job submitted with job id: '2830'
2013-09-10 15:02:47.793 [IPClusterStart] Starting 2 Engines with
PBSEngineSetLauncher
2013-09-10 15:02:47.808 [IPClusterStart] Job submitted with job id: '2831'
Then the queue shows
$ qstat
Job id Name User Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
2830[].master ipcontroller username 0 Q
long
2831[].master ipengine username 0 Q long
And they just hang there, queued forever. I assume the engines at least
should be running? Full information through "qstat -f" doesn't give the
reason for the queuing. Normally it would do. There are more than 4 nodes
available.
$qstat -f
Job Id: 2831[].master.domain
Job_Name = ipengine
Job_Owner = username at master.domain
job_state = Q
queue = long
server = [head node's domain address]
Checkpoint = u
ctime = Tue Sep 10 15:02:47 2013
Error_Path = master.domain:/home/username/
ipengine.e2831
Hold_Types = n
Join_Path = n
Keep_Files = n
Mail_Points = a
mtime = Tue Sep 10 15:02:47 2013
Output_Path = master.domain:/home/username/ipengine.o2831
Priority = 0
qtime = Tue Sep 10 15:02:47 2013
Rerunable = True
[...]
etime = Tue Sep 10 15:02:47 2013
submit_args = ./pbs_engines
job_array_request = 1-2
fault_tolerant = False
submit_host = master.domain
init_work_dir = /home/username
It also seems strange that the ipcontroller is launched through PBS. I
thought this should be on the head node, so I changed
'PBSControllerLauncher' to 'LocalControllerLauncher'. Then it doesn't
queue, but I don't know if what I'm doing is correct.
Any help would be really greatly appreciated.
Thank you.
James
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/ipython-dev/attachments/20130913/501f82e6/attachment.html>
More information about the IPython-dev
mailing list