Dear all, I'm having a lot of trouble setting up IPython parallel on a PBS cluster, and I would really appreciate anybody helping. The architecture is a standard PBS cluster - a head node with slave nodes. I connect to the head node from my laptop over ssh. The client (laptop) -> Head node connection seems simple enough. The problem is the engines. Ignoring the laptop for a moment, I'll just focus on running ipython on the head node, with the engines on a slave node. I assume this is a correct method of working? I did the following on the head node, following instructions at http://ipython.org/ipython-doc/stable/parallel/parallel_process.html#using-i...: $ ipython profile create --parallel --profile=pbs Files are as follows: $cat ipcluster_config.py c = get_config() c.IPClusterStart.controller_launcher_class = 'PBSControllerLauncher' c.IPClusterEngines.engine_launcher_class = 'PBSEngineSetLauncher' c.PBSLauncher.queue = 'long' c.IPClusterEngines.n = 2 # Run 2 cores on 1 node or 2 nodes with all cores? Not sure. $ cat ipengine_config.py c = get_config() Then execute on the head node: $ ipcluster start --profile=pbs -n 2 2013-09-10 15:02:46,771.771 [IPClusterStart] Using existing profile dir: u'/home/username/.ipython/profile_pbs' 2013-09-10 15:02:46.777 [IPClusterStart] Starting ipcluster with [daemon=False] 2013-09-10 15:02:46.778 [IPClusterStart] Creating pid file: /home/username/.ipython/profile_pbs/pid/ipcluster.pid 2013-09-10 15:02:46.778 [IPClusterStart] Starting Controller with PBSControllerLauncher 2013-09-10 15:02:46.792 [IPClusterStart] Job submitted with job id: '2830' 2013-09-10 15:02:47.793 [IPClusterStart] Starting 2 Engines with PBSEngineSetLauncher 2013-09-10 15:02:47.808 [IPClusterStart] Job submitted with job id: '2831' Then the queue shows $ qstat Job id Name User Time Use S Queue ------------------------- ---------------- --------------- -------- - ----- 2830[].master ipcontroller username 0 Q long 2831[].master ipengine username 0 Q long And they just hang there, queued forever. I assume the engines at least should be running? Full information through "qstat -f" doesn't give the reason for the queuing. Normally it would do. There are more than 4 nodes available. $qstat -f Job Id: 2831[].master.domain Job_Name = ipengine Job_Owner = username@master.domain job_state = Q queue = long server = [head node's domain address] Checkpoint = u ctime = Tue Sep 10 15:02:47 2013 Error_Path = master.domain:/home/username/ipengine.e2831 Hold_Types = n Join_Path = n Keep_Files = n Mail_Points = a mtime = Tue Sep 10 15:02:47 2013 Output_Path = master.domain:/home/username/ipengine.o2831 Priority = 0 qtime = Tue Sep 10 15:02:47 2013 Rerunable = True [...] etime = Tue Sep 10 15:02:47 2013 submit_args = ./pbs_engines job_array_request = 1-2 fault_tolerant = False submit_host = master.domain init_work_dir = /home/username It also seems strange that the ipcontroller is launched through PBS. I thought this should be on the head node, so I changed 'PBSControllerLauncher' to 'LocalControllerLauncher'. Then it doesn't queue, but I don't know if what I'm doing is correct. Any help would be really greatly appreciated. Thank you. James
Nobody knows? Alternatively: Can anyone suggest a better place to ask? Maybe the mailing-list activity here is a bit low. Thank you. James On 10 September 2013 15:22, James <jamesresearching@gmail.com> wrote:
Dear all,
I'm having a lot of trouble setting up IPython parallel on a PBS cluster, and I would really appreciate anybody helping.
The architecture is a standard PBS cluster - a head node with slave nodes. I connect to the head node from my laptop over ssh.
The client (laptop) -> Head node connection seems simple enough. The problem is the engines.
Ignoring the laptop for a moment, I'll just focus on running ipython on the head node, with the engines on a slave node. I assume this is a correct method of working?
I did the following on the head node, following instructions at http://ipython.org/ipython-doc/stable/parallel/parallel_process.html#using-i...:
$ ipython profile create --parallel --profile=pbs
Files are as follows:
$cat ipcluster_config.py c = get_config() c.IPClusterStart.controller_launcher_class = 'PBSControllerLauncher' c.IPClusterEngines.engine_launcher_class = 'PBSEngineSetLauncher' c.PBSLauncher.queue = 'long' c.IPClusterEngines.n = 2 # Run 2 cores on 1 node or 2 nodes with all cores? Not sure.
$ cat ipengine_config.py c = get_config()
Then execute on the head node: $ ipcluster start --profile=pbs -n 2 2013-09-10 15:02:46,771.771 [IPClusterStart] Using existing profile dir: u'/home/username/.ipython/profile_pbs' 2013-09-10 15:02:46.777 [IPClusterStart] Starting ipcluster with [daemon=False] 2013-09-10 15:02:46.778 [IPClusterStart] Creating pid file: /home/username/.ipython/profile_pbs/pid/ipcluster.pid 2013-09-10 15:02:46.778 [IPClusterStart] Starting Controller with PBSControllerLauncher 2013-09-10 15:02:46.792 [IPClusterStart] Job submitted with job id: '2830' 2013-09-10 15:02:47.793 [IPClusterStart] Starting 2 Engines with PBSEngineSetLauncher 2013-09-10 15:02:47.808 [IPClusterStart] Job submitted with job id: '2831'
Then the queue shows $ qstat Job id Name User Time Use S Queue ------------------------- ---------------- --------------- -------- - ----- 2830[].master ipcontroller username 0 Q long 2831[].master ipengine username 0 Q long
And they just hang there, queued forever. I assume the engines at least should be running? Full information through "qstat -f" doesn't give the reason for the queuing. Normally it would do. There are more than 4 nodes available.
$qstat -f Job Id: 2831[].master.domain Job_Name = ipengine Job_Owner = username@master.domain job_state = Q queue = long server = [head node's domain address] Checkpoint = u ctime = Tue Sep 10 15:02:47 2013 Error_Path = master.domain:/home/username/ipengine.e2831 Hold_Types = n Join_Path = n Keep_Files = n Mail_Points = a mtime = Tue Sep 10 15:02:47 2013 Output_Path = master.domain:/home/username/ipengine.o2831 Priority = 0 qtime = Tue Sep 10 15:02:47 2013 Rerunable = True [...] etime = Tue Sep 10 15:02:47 2013 submit_args = ./pbs_engines job_array_request = 1-2 fault_tolerant = False submit_host = master.domain init_work_dir = /home/username
It also seems strange that the ipcontroller is launched through PBS. I thought this should be on the head node, so I changed 'PBSControllerLauncher' to 'LocalControllerLauncher'. Then it doesn't queue, but I don't know if what I'm doing is correct.
Any help would be really greatly appreciated.
Thank you.
James
On Fri, Sep 13, 2013 at 8:50 AM, James <jamesresearching@gmail.com> wrote:
Nobody knows?
Alternatively: Can anyone suggest a better place to ask? Maybe the
mailing-list activity here is a bit low. The best place for IPython questions would be the ipython-dev mailing list. http://mail.scipy.org/mailman/listinfo/ipython-dev IPython's website also has a link to a chat room where the devs hang out. They also pay attention to StackOverflow questions tagged "ipython". http://ipython.org/ -- Robert Kern
Thanks a lot for the pointer to the ipython-dev mailing list. I will try there. Best regards, James On 13 September 2013 18:29, Robert Kern <robert.kern@gmail.com> wrote:
On Fri, Sep 13, 2013 at 8:50 AM, James <jamesresearching@gmail.com> wrote:
Nobody knows?
Alternatively: Can anyone suggest a better place to ask? Maybe the
mailing-list activity here is a bit low.
The best place for IPython questions would be the ipython-dev mailing list.
http://mail.scipy.org/mailman/listinfo/ipython-dev
IPython's website also has a link to a chat room where the devs hang out. They also pay attention to StackOverflow questions tagged "ipython".
-- Robert Kern
_______________________________________________ SciPy-User mailing list SciPy-User@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-user
participants (2)
-
James -
Robert Kern