[IPython-dev] 0.11rc1 : problem with tutorial for PBS in http://ipython.org/ipython-doc/dev/parallel/parallel_process.html

Johann Cohen-Tanugi johann.cohentanugi at gmail.com
Tue Jul 5 01:49:27 EDT 2011


Thanks  a lot for all your help Min.
I will check your fix today. I think that for PBS -t should be replaced 
by -J, but I cannot test it as I realized yesterday that the batch 
system I put my hands on is derived from PBS, but is not PBS, and does 
not seem to honor -J. I will switch to SGE, then back to LSF with my 
updated launcher.py once I am sure I understand the SGE code that is 
already shipped with ipython and that you were able to test.

best,
Johann


On 07/05/2011 01:56 AM, MinRK wrote:
> On Mon, Jul 4, 2011 at 13:01, Johann Cohen-Tanugi
> <johann.cohentanugi at gmail.com>  wrote:
>> good evening.... still trying to make the PBS batch parallel code work.
>> I had to comment the "-t" line in launcher.py, but I am still puzzled by
>> the fact that there is no loop over n to start n different engines. Is
>> that because the '-t' was precisely there to create an array of subjobs?
> Sorry about this, I was testing against Linux with Torque, which
> supports job arrays via '-t'.  How to start a collection of jobs is
> going to vary from one PBS to another.
>
> For instance, on some systems the default will be to use *no* job
> array, and run multiple engines via aprun or mpiexec.  I just looked,
> and the addition of the jobarray line is unconditional, which is
> definitely wrong. I just pushed a fix for that, so if you specify your
> own template, it should not be changed at all by IPython.
>
> A possible example:
>
> #PBS -N ipython
> #PBS -j oe
> #PBS -l walltime=00:10:00
> #PBS -l nodes={n/4}:ppn=4 # assumes 4-CPU nodes
> #PBS -q {queue}
>
> cd $PBS_O_WORKDIR
>
> aprun -n {n} ipengine profile_dir={profile_dir}
>
> This is if your system uses aprun, though mpiexec may be more
> appropriate for you.  Note that if you have PBS but no parallel
> environments like mpi or ap, then you may have to do something like a
> simple loop:
>
> for i in {{ 1..{n} }}; do
>      ipengine profile_dir={profile_dir}
> done
>
> # note the double-brace.  The templates use string.format, so to
> escape braces you need to double them, so:
> for i in {{ 1..{n} }}
> becomes
> for i in { 1..4 }
>
> which in bash expands to 1 2 3 4
>
> The templating should support any of these.
>
>> Second question, more general : assuming the use of ipcluster, a
>> controller and several engines are created; following the tutorial, all
>> would actually run in batch, which seems strange to me for the
>> controller : batch queues usually have time limits, and it is
>> unavoidable that engines would die when the cpu time is exceeded, but I
>> do not see why the controller should suffer from this. What would be the
>> rational to execute the controller in batch rather than locally? Second
>> question, once the engines run in batch, I presume that they listen to
>> commands sent from any ipython session that I would interactively start,
>> providing I use the Client() with the correct permissions in terms of
>> ports,ssh etc.... Is that correct, id est is that indeed the idea?
> You can choose to start the controller with batch or not, that's up to
> you.  There is no coupling at all between which launcher you choose
> for the Controller and which you choose for the Engines.  If you want
> the controller to live longer than the batch system will allow, then
> using the batch launcher for the controller is obviously the wrong
> choice.
>
> It does sometimes make sense to launch the controller with batch,
> because you could have an entire job with controller, engines, and
> clients *all* submitted via the batch system.  I do this sometimes
> with SGE for scaling tests using starcluster. It also helps
> load-balancing for shared-node systems, since the controller should
> take up a work slot.  If you have a high throughput workload, you
> don't want to run n engines + a controller on an n-cpu node, because
> they will be fighting over resources.
>
> Two more reasons to start the controller with batch: it will be
> faster, and you can turn off your local machine.  You can submit a
> million jobs, turn off your local machine altogether, then connect
> again later and retrieve your results.  It will be faster, because the
> majority of communication happens between the controller and the
> engines, rather than the client and the controller.
>
> -MinRK
>
>> sorry to be dense about all that... I think it would be useful if the
>> batch doc page was supplemented with the final step which amounts to
>> starting an interactive ipython session and connecting to the batch engines.
> Sure, I can add this, though it's not different from any method of
> connecting to a controller from another machine.
>
> If you are on the same system (e.g. in batch script or on a login
> node), it will amount to:
>
> ipython
> In [1]: from IPython.parallel import Client
> In [2]: rc = Client(profile='clusterprofile')
>
> And if you are not, you will have to get the ipcontroller-client.json
> file from profile_dir/security with scp, and do:
> In [2]: rc = Client('/path/to/ipcontroller-client.json' )
>
> possibly adding `sshserver='loginnode.example.com'` if you didn't
> specify the ssh server for tunneling when starting the Controller.
>
> -MinRK
>
>> will continue digging,
>> best.
>> Johann
>>
>> On 07/04/2011 05:07 PM, Johann Cohen-Tanugi wrote:
>>> hi there, my problem is in the fact that a line seems to be added to the
>>> template I am defining following the tutorial :
>>> the template proposed in the tutorial is modified at runtime as :
>>>
>>> #!/bin/sh
>>> #PBS -t 1-4<----------------- incorrect?
>>> #PBS -V
>>> #PBS -N ipengine
>>> /usr/local/bin/python
>>> /sps/glast/users/cohen/IPYDEV/ipython/IPython/parallel/apps/ipengineapp.py
>>> profile_dir=/afs/in2p3.fr/home/t/tanugi/\
>>> .ipython/profile_pbs
>>>
>>>
>>> The problem I believe is in the job_array_template in  :
>>>
>>> class PBSLauncher(BatchSystemLauncher):
>>>        """A BatchSystemLauncher subclass for PBS."""
>>>
>>>        submit_command = List(['qsub'], config=True,
>>>            help="The PBS submit command ['qsub']")
>>>        delete_command = List(['qdel'], config=True,
>>>            help="The PBS delete command ['qsub']")
>>>        job_id_regexp = Unicode(r'\d+', config=True,
>>>            help="Regular expresion for identifying the job ID [r'\d+']")
>>>
>>>        batch_file = Unicode(u'')
>>>        job_array_regexp = Unicode('#PBS\W+-t\W+[\w\d\-\$]+')
>>>        job_array_template = Unicode('#PBS -t 1-{n}')
>>>        queue_regexp = Unicode('#PBS\W+-q\W+\$?\w+')
>>>        queue_template = Unicode('#PBS -q {queue}')
>>>
>>>
>>> I looked at the PBS doc for version 10 and 11 and I did not see any '-t'
>>> option. When I try to run, I get :
>>> [tanugi at ccali28 test_directory]$ ipcluster start profile=pbs n=4
>>> [IPClusterStart] Using existing profile dir:
>>> u'/afs/in2p3.fr/home/t/tanugi/.ipython/profile_pbs'
>>> [IPClusterStart] Starting ipcluster with [daemon=False]
>>> [IPClusterStart] Creating pid file:
>>> /afs/in2p3.fr/home/t/tanugi/.ipython/profile_pbs/pid/ipcluster.pid
>>> [IPClusterStart] Starting PBSControllerLauncher: ['qsub',
>>> u'/afs/in2p3.fr/home/t/tanugi/.ipython/profile_pbs/pbs_controller']
>>> [IPClusterStart] adding job array settings to batch script
>>> [IPClusterStart] Writing instantiated batch script:
>>> /afs/in2p3.fr/home/t/tanugi/.ipython/profile_pbs/pbs_controller
>>> unknown -t option
>>> ERROR:root:Error in periodic callback
>>> Traceback (most recent call last):
>>>      File
>>> "/sps/glast/users/cohen/IPYDEV/local/lib/python2.6/site-packages/zmq/eventloop/ioloop.py",
>>> line 432, in _run
>>>        self.callback()
>>>      File
>>> "/sps/glast/users/cohen/IPYDEV/ipython/IPython/parallel/apps/ipclusterapp.py",
>>> line 364, in start_controller
>>>        self.profile_dir.location
>>>      File
>>> "/sps/glast/users/cohen/IPYDEV/ipython/IPython/parallel/apps/launcher.py",
>>> line 943, in start
>>>        return super(PBSControllerLauncher, self).start(1, profile_dir)
>>>      File
>>> "/sps/glast/users/cohen/IPYDEV/ipython/IPython/parallel/apps/launcher.py",
>>> line 902, in start
>>>        job_id = self.parse_job_id(output)
>>>      File
>>> "/sps/glast/users/cohen/IPYDEV/ipython/IPython/parallel/apps/launcher.py",
>>> line 854, in parse_job_id
>>>        raise LauncherError("Job id couldn't be determined: %s" % output)
>>> LauncherError: Job id couldn't be determined:
>>>
>>> Not sure yet about the traceback, but the "unknown -t option" is clear.
>>> Furthermore, I wonder if it is really what we want to add lines to a
>>> template file provided by the user?
>>>
>>> best,
>>> Johann
>>> _______________________________________________
>>> IPython-dev mailing list
>>> IPython-dev at scipy.org
>>> http://mail.scipy.org/mailman/listinfo/ipython-dev
>>>
>> _______________________________________________
>> IPython-dev mailing list
>> IPython-dev at scipy.org
>> http://mail.scipy.org/mailman/listinfo/ipython-dev
>>



More information about the IPython-dev mailing list