[IPython-dev] Using IPython Cluster with SGE -- help needed
Matthieu Brucher
matthieu.brucher at gmail.com
Mon Aug 5 10:02:47 EDT 2013
Hi,
I don't know why the registration was not complete. Is your home
folder the same on all nodes and on the login node?
You won't see 12 jobs. You asked for 12 engines, and they will all be
submitted in one job and the 12 engines will be started by mpiexec -n
12. This is the standard way of using batch schedulers. Ask for some
cores, run an mpi application on these cores.
You can also try to submit additional engines now that the controller
is up and running. Check that the configuration files are present and
readable.
Cheers,
2013/8/5 Andreas Hilboll <lists at hilboll.de>:
> Am 04.08.2013 16:20, schrieb Matthieu Brucher:
>> Hi,
>>
>> I guess we may want to start with the ipython documentation on this
>> topic: http://ipython.org/ipython-doc/stable/parallel/parallel_process.html
>>
>> Cheers,
>>
>> 2013/8/4 Andreas Hilboll <lists at hilboll.de>:
>>> Hi,
>>>
>>> I would like to use IPython for calculations on our cluster. It's a
>>> total of 11 compute + 1 management nodes (all running Linux), and we're
>>> using SGE's qsub to submit jobs. The $HOME directory is shared via NFS
>>> between all the nodes.
>>>
>>> Even after reading the documentation, I'm unsure about how to get things
>>> running. I assume that I'll have to execute ``ìpcluster -n 16`` on all
>>> compute nodes (they have 16 cores each). I'd have the ipython shell
>>> (notebook won't work due to firewall restrictions I cannot change) on
>>> the management node. But how does the management node know about the
>>> kernels which are running on the compute nodes and waiting for a job?
>>> And how can I tell the management node that it shall use qsub to submit
>>> the jobs to the individual kernels?
>>>
>>> As I think this is a common use case, I'd be willing to write up a nice
>>> tutorial about the setup, but I fear I need some help from you guys to
>>> get things running ...
>>>
>>> Cheers,
>>>
>>> -- Andreas.
>>> _______________________________________________
>>> IPython-dev mailing list
>>> IPython-dev at scipy.org
>>> http://mail.scipy.org/mailman/listinfo/ipython-dev
>>
>>
>>
>
> Okay, thanks to the good docs, I was able to start a cluster:
>
> (test_py27)hilboll at login:~> ipcluster start --profile=nexus_py2.7 -n 12
> 2013-08-05 15:26:04,264.264 [IPClusterStart] Using existing profile dir:
> u'/gpfs/hb/hilboll/.config/ipython/profile_nexus_py2.7'
> 2013-08-05 15:26:04.272 [IPClusterStart] Starting ipcluster with
> [daemon=False]
> 2013-08-05 15:26:04.273 [IPClusterStart] Creating pid file:
> /gpfs/hb/hilboll/.config/ipython/profile_nexus_py2.7/pid/ipcluster.pid
> 2013-08-05 15:26:04.273 [IPClusterStart] Starting Controller with
> SGEControllerLauncher
> 2013-08-05 15:26:04.289 [IPClusterStart] Job submitted with job id: '60'
> 2013-08-05 15:26:05.289 [IPClusterStart] Starting 12 Engines with
> SGEEngineSetLauncher
> 2013-08-05 15:26:05.306 [IPClusterStart] Job submitted with job id: '61'
> 2013-08-05 15:26:35.351 [IPClusterStart] Engines appear to have started
> successfully
>
> However, using qstat, I can only see one job in the queue, which is the
> controller:
>
> hilboll at login:~> qstat
> job-ID prior name user state submit/start at queue
> slots ja-task-ID
> -----------------------------------------------------------------------------------------------------------------
> 60 0.57500 ipython hilboll r 08/05/2013 15:26:06
> all.q at login.cluster 1
>
>
> I used the following job template:
>
> c.SGEEngineSetLauncher.batch_template = '''#!/bin/bash
> #$ -N ipython #- Name optional!
> #$ -q all.q #- Nutze die Queue 'all.q'.
> #$ -S /bin/bash #- erforderlich !
> #$ -V #- Verwendet Pfade wie in aktueller Shell
> #$ -j y #- merge STDOUT and STDERR
> #$ -o log_ipython_{n}.log
>
> source /hb/hilboll/local/anaconda/bin/activate test_py27
> mpiexec -n {n} ipengine --profile-dir={profile_dir}
> '''
>
> If I use a 'blank' ``ipengine --profile-dir={profile_dir}`` instead of
> the mpiexec call, I get exactly two jobs in the queue, one for the
> controller and one for the first engine.
>
> My naive understanding would be that exactly {n} jobs get submitted via
> the SGEEngineSetLauncher. Is my expectation wrong?
>
> In the logfile, I get this here, 12 times:
>
> 2013-08-05 15:26:09.038 [IPEngineApp] Registration timed out after 2.0
> seconds
>
> Any help resolving this issue is greatly appreciated :)
>
> Cheers,
>
> -- Andreas.
--
Information System Engineer, Ph.D.
Blog: http://matt.eifelle.com
LinkedIn: http://www.linkedin.com/in/matthieubrucher
Music band: http://liliejay.com/
More information about the IPython-dev
mailing list