[IPython-dev] Ipython parallel and PBS

Fri Sep 13 06:38:30 EDT 2013

Dear all,

I'm having a lot of trouble setting up IPython parallel on a PBS cluster,
and I would really appreciate any help.

The architecture is a standard PBS cluster - a head node with slave nodes.
I connect to the head node from my laptop over ssh.

The client (laptop) -> Head node connection seems simple enough. The
problem is the engines.

Ignoring the laptop for a moment, I'll just focus on running ipython on the
head node, with the engines on a slave node. I assume this is a correct
method of working?

I did the following on the head node, following instructions at
http://ipython.org/ipython-doc/stable/parallel/parallel_process.html#using-ipcluster-in-pbs-mode:

$ ipython profile create --parallel --profile=pbs

Files are as follows:

$cat ipcluster_config.py
c = get_config()
c.IPClusterStart.controller_launcher_class = 'PBSControllerLauncher'
c.IPClusterEngines.engine_launcher_class = 'PBSEngineSetLauncher'
c.PBSLauncher.queue = 'long'
c.IPClusterEngines.n = 2 # Run 2 cores on 1 node or 2 nodes with all cores?
Not sure.

$ cat ipengine_config.py
c = get_config()

Then execute on the head node:
$ ipcluster start --profile=pbs -n 2
2013-09-10 15:02:46,771.771 [IPClusterStart] Using existing profile dir:
u'/home/username/.ipython/profile_pbs'
2013-09-10 15:02:46.777 [IPClusterStart] Starting ipcluster with
[daemon=False]
2013-09-10 15:02:46.778 [IPClusterStart] Creating pid file:
/home/username/.ipython/profile_pbs/pid/ipcluster.pid
2013-09-10 15:02:46.778 [IPClusterStart] Starting Controller with
PBSControllerLauncher
2013-09-10 15:02:46.792 [IPClusterStart] Job submitted with job id: '2830'
2013-09-10 15:02:47.793 [IPClusterStart] Starting 2 Engines with
PBSEngineSetLauncher
2013-09-10 15:02:47.808 [IPClusterStart] Job submitted with job id: '2831'

Then the queue shows
$ qstat
Job id                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
2830[].master              ipcontroller     username              0 Q
long
2831[].master              ipengine         username              0 Q long

And they just hang there, queued forever. I assume the engines at least
should be running? Full information through "qstat -f" doesn't give the
reason for the queuing. Normally it would do. There are more than 4 nodes
available.

$qstat -f
Job Id: 2831[].master.domain
    Job_Name = ipengine
    Job_Owner = username at master.domain
    job_state = Q
    queue = long
    server = [head node's domain address]
    Checkpoint = u
    ctime = Tue Sep 10 15:02:47 2013
    Error_Path = master.domain:/home/username/
ipengine.e2831
    Hold_Types = n
    Join_Path = n
    Keep_Files = n
    Mail_Points = a
    mtime = Tue Sep 10 15:02:47 2013
    Output_Path = master.domain:/home/username/ipengine.o2831
    Priority = 0
    qtime = Tue Sep 10 15:02:47 2013
    Rerunable = True
    [...]
    etime = Tue Sep 10 15:02:47 2013
    submit_args = ./pbs_engines
    job_array_request = 1-2
    fault_tolerant = False
    submit_host = master.domain
    init_work_dir = /home/username

It also seems strange that the ipcontroller is launched through PBS. I
thought this should be on the head node, so I changed
'PBSControllerLauncher' to 'LocalControllerLauncher'. Then it doesn't
queue, but I don't know if what I'm doing is correct.

Any help would be really greatly appreciated.

Thank you.

James
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/ipython-dev/attachments/20130913/501f82e6/attachment.html>