[IPython-dev] Ipython parallel and PBS
MinRK
benjaminrk at gmail.com
Fri Sep 13 14:14:17 EDT 2013
Can you inspect the pbs_engines template, and see if anything looks wrong?
Can you submit it manually, with qsub ./pbs_engines?
On Fri, Sep 13, 2013 at 3:38 AM, James <jamesresearching at gmail.com> wrote:
> Dear all,
>
> I'm having a lot of trouble setting up IPython parallel on a PBS cluster,
> and I would really appreciate any help.
>
> The architecture is a standard PBS cluster - a head node with slave nodes.
> I connect to the head node from my laptop over ssh.
>
> The client (laptop) -> Head node connection seems simple enough. The
> problem is the engines.
>
> Ignoring the laptop for a moment, I'll just focus on running ipython on
> the head node, with the engines on a slave node. I assume this is a correct
> method of working?
>
> I did the following on the head node, following instructions at
> http://ipython.org/ipython-doc/stable/parallel/parallel_process.html#using-ipcluster-in-pbs-mode:
>
> $ ipython profile create --parallel --profile=pbs
>
> Files are as follows:
>
> $cat ipcluster_config.py
> c = get_config()
> c.IPClusterStart.controller_launcher_class = 'PBSControllerLauncher'
> c.IPClusterEngines.engine_launcher_class = 'PBSEngineSetLauncher'
> c.PBSLauncher.queue = 'long'
> c.IPClusterEngines.n = 2 # Run 2 cores on 1 node or 2 nodes with all
> cores? Not sure.
>
> $ cat ipengine_config.py
> c = get_config()
>
> Then execute on the head node:
> $ ipcluster start --profile=pbs -n 2
> 2013-09-10 15:02:46,771.771 [IPClusterStart] Using existing profile dir:
> u'/home/username/.ipython/profile_pbs'
> 2013-09-10 15:02:46.777 [IPClusterStart] Starting ipcluster with
> [daemon=False]
> 2013-09-10 15:02:46.778 [IPClusterStart] Creating pid file:
> /home/username/.ipython/profile_pbs/pid/ipcluster.pid
> 2013-09-10 15:02:46.778 [IPClusterStart] Starting Controller with
> PBSControllerLauncher
> 2013-09-10 15:02:46.792 [IPClusterStart] Job submitted with job id: '2830'
> 2013-09-10 15:02:47.793 [IPClusterStart] Starting 2 Engines with
> PBSEngineSetLauncher
> 2013-09-10 15:02:47.808 [IPClusterStart] Job submitted with job id: '2831'
>
> Then the queue shows
> $ qstat
> Job id Name User Time Use S Queue
> ------------------------- ---------------- --------------- -------- - -----
> 2830[].master ipcontroller username 0 Q
> long
> 2831[].master ipengine username 0 Q
> long
>
> And they just hang there, queued forever. I assume the engines at least
> should be running? Full information through "qstat -f" doesn't give the
> reason for the queuing. Normally it would do. There are more than 4 nodes
> available.
>
> $qstat -f
> Job Id: 2831[].master.domain
> Job_Name = ipengine
> Job_Owner = username at master.domain
> job_state = Q
> queue = long
> server = [head node's domain address]
> Checkpoint = u
> ctime = Tue Sep 10 15:02:47 2013
> Error_Path = master.domain:/home/username/
> ipengine.e2831
> Hold_Types = n
> Join_Path = n
> Keep_Files = n
> Mail_Points = a
> mtime = Tue Sep 10 15:02:47 2013
> Output_Path = master.domain:/home/username/ipengine.o2831
> Priority = 0
> qtime = Tue Sep 10 15:02:47 2013
> Rerunable = True
> [...]
> etime = Tue Sep 10 15:02:47 2013
> submit_args = ./pbs_engines
> job_array_request = 1-2
> fault_tolerant = False
> submit_host = master.domain
> init_work_dir = /home/username
>
> It also seems strange that the ipcontroller is launched through PBS. I
> thought this should be on the head node, so I changed
> 'PBSControllerLauncher' to 'LocalControllerLauncher'. Then it doesn't
> queue, but I don't know if what I'm doing is correct.
>
> Any help would be really greatly appreciated.
>
> Thank you.
>
> James
>
> _______________________________________________
> IPython-dev mailing list
> IPython-dev at scipy.org
> http://mail.scipy.org/mailman/listinfo/ipython-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/ipython-dev/attachments/20130913/cd410fa6/attachment.html>
More information about the IPython-dev
mailing list