[IPython-dev] 0.11rc1 : problem with tutorial for PBS in http://ipython.org/ipython-doc/dev/parallel/parallel_process.html
MinRK
benjaminrk at gmail.com
Mon Jul 4 19:56:13 EDT 2011
On Mon, Jul 4, 2011 at 13:01, Johann Cohen-Tanugi
<johann.cohentanugi at gmail.com> wrote:
> good evening.... still trying to make the PBS batch parallel code work.
> I had to comment the "-t" line in launcher.py, but I am still puzzled by
> the fact that there is no loop over n to start n different engines. Is
> that because the '-t' was precisely there to create an array of subjobs?
Sorry about this, I was testing against Linux with Torque, which
supports job arrays via '-t'. How to start a collection of jobs is
going to vary from one PBS to another.
For instance, on some systems the default will be to use *no* job
array, and run multiple engines via aprun or mpiexec. I just looked,
and the addition of the jobarray line is unconditional, which is
definitely wrong. I just pushed a fix for that, so if you specify your
own template, it should not be changed at all by IPython.
A possible example:
#PBS -N ipython
#PBS -j oe
#PBS -l walltime=00:10:00
#PBS -l nodes={n/4}:ppn=4 # assumes 4-CPU nodes
#PBS -q {queue}
cd $PBS_O_WORKDIR
aprun -n {n} ipengine profile_dir={profile_dir}
This is if your system uses aprun, though mpiexec may be more
appropriate for you. Note that if you have PBS but no parallel
environments like mpi or ap, then you may have to do something like a
simple loop:
for i in {{ 1..{n} }}; do
ipengine profile_dir={profile_dir}
done
# note the double-brace. The templates use string.format, so to
escape braces you need to double them, so:
for i in {{ 1..{n} }}
becomes
for i in { 1..4 }
which in bash expands to 1 2 3 4
The templating should support any of these.
>
> Second question, more general : assuming the use of ipcluster, a
> controller and several engines are created; following the tutorial, all
> would actually run in batch, which seems strange to me for the
> controller : batch queues usually have time limits, and it is
> unavoidable that engines would die when the cpu time is exceeded, but I
> do not see why the controller should suffer from this. What would be the
> rational to execute the controller in batch rather than locally? Second
> question, once the engines run in batch, I presume that they listen to
> commands sent from any ipython session that I would interactively start,
> providing I use the Client() with the correct permissions in terms of
> ports,ssh etc.... Is that correct, id est is that indeed the idea?
You can choose to start the controller with batch or not, that's up to
you. There is no coupling at all between which launcher you choose
for the Controller and which you choose for the Engines. If you want
the controller to live longer than the batch system will allow, then
using the batch launcher for the controller is obviously the wrong
choice.
It does sometimes make sense to launch the controller with batch,
because you could have an entire job with controller, engines, and
clients *all* submitted via the batch system. I do this sometimes
with SGE for scaling tests using starcluster. It also helps
load-balancing for shared-node systems, since the controller should
take up a work slot. If you have a high throughput workload, you
don't want to run n engines + a controller on an n-cpu node, because
they will be fighting over resources.
Two more reasons to start the controller with batch: it will be
faster, and you can turn off your local machine. You can submit a
million jobs, turn off your local machine altogether, then connect
again later and retrieve your results. It will be faster, because the
majority of communication happens between the controller and the
engines, rather than the client and the controller.
-MinRK
>
> sorry to be dense about all that... I think it would be useful if the
> batch doc page was supplemented with the final step which amounts to
> starting an interactive ipython session and connecting to the batch engines.
Sure, I can add this, though it's not different from any method of
connecting to a controller from another machine.
If you are on the same system (e.g. in batch script or on a login
node), it will amount to:
ipython
In [1]: from IPython.parallel import Client
In [2]: rc = Client(profile='clusterprofile')
And if you are not, you will have to get the ipcontroller-client.json
file from profile_dir/security with scp, and do:
In [2]: rc = Client('/path/to/ipcontroller-client.json' )
possibly adding `sshserver='loginnode.example.com'` if you didn't
specify the ssh server for tunneling when starting the Controller.
-MinRK
>
> will continue digging,
> best.
> Johann
>
> On 07/04/2011 05:07 PM, Johann Cohen-Tanugi wrote:
>> hi there, my problem is in the fact that a line seems to be added to the
>> template I am defining following the tutorial :
>> the template proposed in the tutorial is modified at runtime as :
>>
>> #!/bin/sh
>> #PBS -t 1-4<----------------- incorrect?
>> #PBS -V
>> #PBS -N ipengine
>> /usr/local/bin/python
>> /sps/glast/users/cohen/IPYDEV/ipython/IPython/parallel/apps/ipengineapp.py
>> profile_dir=/afs/in2p3.fr/home/t/tanugi/\
>> .ipython/profile_pbs
>>
>>
>> The problem I believe is in the job_array_template in :
>>
>> class PBSLauncher(BatchSystemLauncher):
>> """A BatchSystemLauncher subclass for PBS."""
>>
>> submit_command = List(['qsub'], config=True,
>> help="The PBS submit command ['qsub']")
>> delete_command = List(['qdel'], config=True,
>> help="The PBS delete command ['qsub']")
>> job_id_regexp = Unicode(r'\d+', config=True,
>> help="Regular expresion for identifying the job ID [r'\d+']")
>>
>> batch_file = Unicode(u'')
>> job_array_regexp = Unicode('#PBS\W+-t\W+[\w\d\-\$]+')
>> job_array_template = Unicode('#PBS -t 1-{n}')
>> queue_regexp = Unicode('#PBS\W+-q\W+\$?\w+')
>> queue_template = Unicode('#PBS -q {queue}')
>>
>>
>> I looked at the PBS doc for version 10 and 11 and I did not see any '-t'
>> option. When I try to run, I get :
>> [tanugi at ccali28 test_directory]$ ipcluster start profile=pbs n=4
>> [IPClusterStart] Using existing profile dir:
>> u'/afs/in2p3.fr/home/t/tanugi/.ipython/profile_pbs'
>> [IPClusterStart] Starting ipcluster with [daemon=False]
>> [IPClusterStart] Creating pid file:
>> /afs/in2p3.fr/home/t/tanugi/.ipython/profile_pbs/pid/ipcluster.pid
>> [IPClusterStart] Starting PBSControllerLauncher: ['qsub',
>> u'/afs/in2p3.fr/home/t/tanugi/.ipython/profile_pbs/pbs_controller']
>> [IPClusterStart] adding job array settings to batch script
>> [IPClusterStart] Writing instantiated batch script:
>> /afs/in2p3.fr/home/t/tanugi/.ipython/profile_pbs/pbs_controller
>> unknown -t option
>> ERROR:root:Error in periodic callback
>> Traceback (most recent call last):
>> File
>> "/sps/glast/users/cohen/IPYDEV/local/lib/python2.6/site-packages/zmq/eventloop/ioloop.py",
>> line 432, in _run
>> self.callback()
>> File
>> "/sps/glast/users/cohen/IPYDEV/ipython/IPython/parallel/apps/ipclusterapp.py",
>> line 364, in start_controller
>> self.profile_dir.location
>> File
>> "/sps/glast/users/cohen/IPYDEV/ipython/IPython/parallel/apps/launcher.py",
>> line 943, in start
>> return super(PBSControllerLauncher, self).start(1, profile_dir)
>> File
>> "/sps/glast/users/cohen/IPYDEV/ipython/IPython/parallel/apps/launcher.py",
>> line 902, in start
>> job_id = self.parse_job_id(output)
>> File
>> "/sps/glast/users/cohen/IPYDEV/ipython/IPython/parallel/apps/launcher.py",
>> line 854, in parse_job_id
>> raise LauncherError("Job id couldn't be determined: %s" % output)
>> LauncherError: Job id couldn't be determined:
>>
>> Not sure yet about the traceback, but the "unknown -t option" is clear.
>> Furthermore, I wonder if it is really what we want to add lines to a
>> template file provided by the user?
>>
>> best,
>> Johann
>> _______________________________________________
>> IPython-dev mailing list
>> IPython-dev at scipy.org
>> http://mail.scipy.org/mailman/listinfo/ipython-dev
>>
> _______________________________________________
> IPython-dev mailing list
> IPython-dev at scipy.org
> http://mail.scipy.org/mailman/listinfo/ipython-dev
>
More information about the IPython-dev
mailing list