[IPython-dev] Using IPython Cluster with SGE -- help needed

Mon Aug 5 09:45:08 EDT 2013

Am 04.08.2013 16:20, schrieb Matthieu Brucher:
> Hi,
> 
> I guess we may want to start with the ipython documentation on this
> topic: http://ipython.org/ipython-doc/stable/parallel/parallel_process.html
> 
> Cheers,
> 
> 2013/8/4 Andreas Hilboll <lists at hilboll.de>:
>> Hi,
>>
>> I would like to use IPython for calculations on our cluster. It's a
>> total of 11 compute + 1 management nodes (all running Linux), and we're
>> using SGE's qsub to submit jobs. The $HOME directory is shared via NFS
>> between all the nodes.
>>
>> Even after reading the documentation, I'm unsure about how to get things
>> running. I assume that I'll have to execute ``ìpcluster -n 16`` on all
>> compute nodes (they have 16 cores each). I'd have the ipython shell
>> (notebook won't work due to firewall restrictions I cannot change) on
>> the management node. But how does the management node know about the
>> kernels which are running on the compute nodes and waiting for a job?
>> And how can I tell the management node that it shall use qsub to submit
>> the jobs to the individual kernels?
>>
>> As I think this is a common use case, I'd be willing to write up a nice
>> tutorial about the setup, but I fear I need some help from you guys to
>> get things running ...
>>
>> Cheers,
>>
>> -- Andreas.
>> _______________________________________________
>> IPython-dev mailing list
>> IPython-dev at scipy.org
>> http://mail.scipy.org/mailman/listinfo/ipython-dev
> 
> 
> 

Okay, thanks to the good docs, I was able to start a cluster:

(test_py27)hilboll at login:~> ipcluster start --profile=nexus_py2.7 -n 12
2013-08-05 15:26:04,264.264 [IPClusterStart] Using existing profile dir:
u'/gpfs/hb/hilboll/.config/ipython/profile_nexus_py2.7'
2013-08-05 15:26:04.272 [IPClusterStart] Starting ipcluster with
[daemon=False]
2013-08-05 15:26:04.273 [IPClusterStart] Creating pid file:
/gpfs/hb/hilboll/.config/ipython/profile_nexus_py2.7/pid/ipcluster.pid
2013-08-05 15:26:04.273 [IPClusterStart] Starting Controller with
SGEControllerLauncher
2013-08-05 15:26:04.289 [IPClusterStart] Job submitted with job id: '60'
2013-08-05 15:26:05.289 [IPClusterStart] Starting 12 Engines with
SGEEngineSetLauncher
2013-08-05 15:26:05.306 [IPClusterStart] Job submitted with job id: '61'
2013-08-05 15:26:35.351 [IPClusterStart] Engines appear to have started
successfully

However, using qstat, I can only see one job in the queue, which is the
controller:

hilboll at login:~> qstat
job-ID  prior   name       user         state submit/start at     queue
                         slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
     60 0.57500 ipython    hilboll      r     08/05/2013 15:26:06
all.q at login.cluster                1

I used the following job template:

c.SGEEngineSetLauncher.batch_template = '''#!/bin/bash
#$ -N ipython #- Name optional!
#$ -q all.q #- Nutze die Queue 'all.q'.
#$ -S /bin/bash #- erforderlich !
#$ -V #- Verwendet Pfade wie in aktueller Shell
#$ -j y #- merge STDOUT and STDERR
#$ -o log_ipython_{n}.log

source /hb/hilboll/local/anaconda/bin/activate test_py27
mpiexec -n {n} ipengine --profile-dir={profile_dir}
'''

If I use a 'blank' ``ipengine --profile-dir={profile_dir}`` instead of
the mpiexec call, I get exactly two jobs in the queue, one for the
controller and one for the first engine.

My naive understanding would be that exactly {n} jobs get submitted via
the SGEEngineSetLauncher. Is my expectation wrong?

In the logfile, I get this here, 12 times:

2013-08-05 15:26:09.038 [IPEngineApp] Registration timed out after 2.0
seconds

Any help resolving this issue is greatly appreciated :)

Cheers,

-- Andreas.