[IPython-dev] 0.11rc1 : problem with tutorial for PBS in http://ipython.org/ipython-doc/dev/parallel/parallel_process.html

Johann Cohen-Tanugi johann.cohentanugi at gmail.com
Wed Jul 6 08:33:36 EDT 2011


hi Min, so your fix worked, thanks. I managed to connect to SGE running 
engines within an ipython session, and thus started to clone the code 
for LSF. But I have the following issue : with the LSF farm I have at 
hand, I am forced to use "bsub < batch_script" rather than "bsub 
batch_script". I asked the admins about this and hope to have an answer 
tonight. Note that in 
https://github.com/ipython/ipython/blob/0.10.2/IPython/kernel/scripts/ipcluster.py 
, the same issue seems to have been encountered as I can see :

bsub_wrapper="""#!/bin/sh
bsub < $1

So I tried
class LSFLauncher(BatchSystemLauncher):
     """A BatchSystemLauncher subclass for LSF."""

     submit_command = List(['bsub >'], config=True,<....>

but this results in a crash :
-bash-3.2$ ipcluster start n=4 profile=lsf
[IPClusterStart] Using existing profile dir: 
u'/u/ec/cohen/.config/ipython/profile_lsf'
[IPClusterStart] Starting ipcluster with [daemon=False]
[IPClusterStart] Creating pid file: 
/u/ec/cohen/.config/ipython/profile_lsf/pid/ipcluster.pid
[IPClusterStart] Starting LocalControllerLauncher: 
['/afs/slac/g/glast/ground/GLAST_EXT/redhat5-x86_64-64bit-gcc41/python/2.6.5/gcc41/bin/python', 
u'/a/wain006/g.glast.u54/cohen/IPYDEV/ipython/IPython/parallel/apps/ipcontrollerapp.py', 
'--log-to-file', 'log_level=20', 
u'profile_dir=/u/ec/cohen/.config/ipython/profile_lsf']
[IPClusterStart] Process 
'/afs/slac/g/glast/ground/GLAST_EXT/redhat5-x86_64-64bit-gcc41/python/2.6.5/gcc41/bin/python' 
started: 16921
[IPClusterStart] Starting 4 engines
[IPClusterStart] Starting 4 engines with LSFEngineSetLauncher: ['bsub 
\\<', u'/u/ec/cohen/.config/ipython/profile_lsf/lsf_engines']
[IPClusterStart] adding job array settings to batch script
[IPClusterStart] Writing instantiated batch script: 
/u/ec/cohen/.config/ipython/profile_lsf/lsf_engines
ERROR:root:Error in periodic callback
Traceback (most recent call last):
   File 
"/afs/slac/g/glast/users/cohen/IPYDEV/local/lib/python2.6/site-packages/zmq/eventloop/ioloop.py", 
line 432, in _run
     self.callback()
   File 
"/a/wain006/g.glast.u54/cohen/IPYDEV/ipython/IPython/parallel/apps/ipclusterapp.py", 
line 258, in start_engines
     self.profile_dir.location
   File 
"/a/wain006/g.glast.u54/cohen/IPYDEV/ipython/IPython/parallel/apps/launcher.py", 
line 1049, in start
     return super(LSFEngineSetLauncher, self).start(n, profile_dir)
   File 
"/a/wain006/g.glast.u54/cohen/IPYDEV/ipython/IPython/parallel/apps/launcher.py", 
line 902, in start
     output = check_output(self.args, env=os.environ)
   File 
"/a/wain006/g.glast.u54/cohen/IPYDEV/ipython/IPython/parallel/apps/launcher.py", 
line 51, in check_output
     p = Popen(*args, **kwargs)
   File 
"/afs/slac/g/glast/ground/GLAST_EXT/redhat5-x86_64-64bit-gcc41/python/2.6.5/gcc41/lib/python2.6/subprocess.py", 
line 633, in __init__
     errread, errwrite)
   File 
"/afs/slac/g/glast/ground/GLAST_EXT/redhat5-x86_64-64bit-gcc41/python/2.6.5/gcc41/lib/python2.6/subprocess.py", 
line 1139, in _execute_child
     raise child_exception
OSError: [Errno 2] No such file or directory
[IPClusterStart] [IPControllerApp] Using existing profile dir: 
u'/u/ec/cohen/.config/ipython/profile_lsf'
ERROR:IPClusterStart:[IPControllerApp] Using existing profile dir: 
u'/u/ec/cohen/.config/ipython/profile_lsf'
[IPClusterStart] Scheduler started [leastload]
ERROR:IPClusterStart:Scheduler started [leastload]

and the engines are not started in the batch....

thoughts?
Johann


On 07/05/2011 07:49 AM, Johann Cohen-Tanugi wrote:
> Thanks  a lot for all your help Min.
> I will check your fix today. I think that for PBS -t should be replaced
> by -J, but I cannot test it as I realized yesterday that the batch
> system I put my hands on is derived from PBS, but is not PBS, and does
> not seem to honor -J. I will switch to SGE, then back to LSF with my
> updated launcher.py once I am sure I understand the SGE code that is
> already shipped with ipython and that you were able to test.
>
> best,
> Johann
>
>
> On 07/05/2011 01:56 AM, MinRK wrote:
>> On Mon, Jul 4, 2011 at 13:01, Johann Cohen-Tanugi
>> <johann.cohentanugi at gmail.com>   wrote:
>>> good evening.... still trying to make the PBS batch parallel code work.
>>> I had to comment the "-t" line in launcher.py, but I am still puzzled by
>>> the fact that there is no loop over n to start n different engines. Is
>>> that because the '-t' was precisely there to create an array of subjobs?
>> Sorry about this, I was testing against Linux with Torque, which
>> supports job arrays via '-t'.  How to start a collection of jobs is
>> going to vary from one PBS to another.
>>
>> For instance, on some systems the default will be to use *no* job
>> array, and run multiple engines via aprun or mpiexec.  I just looked,
>> and the addition of the jobarray line is unconditional, which is
>> definitely wrong. I just pushed a fix for that, so if you specify your
>> own template, it should not be changed at all by IPython.
>>
>> A possible example:
>>
>> #PBS -N ipython
>> #PBS -j oe
>> #PBS -l walltime=00:10:00
>> #PBS -l nodes={n/4}:ppn=4 # assumes 4-CPU nodes
>> #PBS -q {queue}
>>
>> cd $PBS_O_WORKDIR
>>
>> aprun -n {n} ipengine profile_dir={profile_dir}
>>
>> This is if your system uses aprun, though mpiexec may be more
>> appropriate for you.  Note that if you have PBS but no parallel
>> environments like mpi or ap, then you may have to do something like a
>> simple loop:
>>
>> for i in {{ 1..{n} }}; do
>>       ipengine profile_dir={profile_dir}
>> done
>>
>> # note the double-brace.  The templates use string.format, so to
>> escape braces you need to double them, so:
>> for i in {{ 1..{n} }}
>> becomes
>> for i in { 1..4 }
>>
>> which in bash expands to 1 2 3 4
>>
>> The templating should support any of these.
>>
>>> Second question, more general : assuming the use of ipcluster, a
>>> controller and several engines are created; following the tutorial, all
>>> would actually run in batch, which seems strange to me for the
>>> controller : batch queues usually have time limits, and it is
>>> unavoidable that engines would die when the cpu time is exceeded, but I
>>> do not see why the controller should suffer from this. What would be the
>>> rational to execute the controller in batch rather than locally? Second
>>> question, once the engines run in batch, I presume that they listen to
>>> commands sent from any ipython session that I would interactively start,
>>> providing I use the Client() with the correct permissions in terms of
>>> ports,ssh etc.... Is that correct, id est is that indeed the idea?
>> You can choose to start the controller with batch or not, that's up to
>> you.  There is no coupling at all between which launcher you choose
>> for the Controller and which you choose for the Engines.  If you want
>> the controller to live longer than the batch system will allow, then
>> using the batch launcher for the controller is obviously the wrong
>> choice.
>>
>> It does sometimes make sense to launch the controller with batch,
>> because you could have an entire job with controller, engines, and
>> clients *all* submitted via the batch system.  I do this sometimes
>> with SGE for scaling tests using starcluster. It also helps
>> load-balancing for shared-node systems, since the controller should
>> take up a work slot.  If you have a high throughput workload, you
>> don't want to run n engines + a controller on an n-cpu node, because
>> they will be fighting over resources.
>>
>> Two more reasons to start the controller with batch: it will be
>> faster, and you can turn off your local machine.  You can submit a
>> million jobs, turn off your local machine altogether, then connect
>> again later and retrieve your results.  It will be faster, because the
>> majority of communication happens between the controller and the
>> engines, rather than the client and the controller.
>>
>> -MinRK
>>
>>> sorry to be dense about all that... I think it would be useful if the
>>> batch doc page was supplemented with the final step which amounts to
>>> starting an interactive ipython session and connecting to the batch engines.
>> Sure, I can add this, though it's not different from any method of
>> connecting to a controller from another machine.
>>
>> If you are on the same system (e.g. in batch script or on a login
>> node), it will amount to:
>>
>> ipython
>> In [1]: from IPython.parallel import Client
>> In [2]: rc = Client(profile='clusterprofile')
>>
>> And if you are not, you will have to get the ipcontroller-client.json
>> file from profile_dir/security with scp, and do:
>> In [2]: rc = Client('/path/to/ipcontroller-client.json' )
>>
>> possibly adding `sshserver='loginnode.example.com'` if you didn't
>> specify the ssh server for tunneling when starting the Controller.
>>
>> -MinRK
>>
>>> will continue digging,
>>> best.
>>> Johann
>>>
>>> On 07/04/2011 05:07 PM, Johann Cohen-Tanugi wrote:
>>>> hi there, my problem is in the fact that a line seems to be added to the
>>>> template I am defining following the tutorial :
>>>> the template proposed in the tutorial is modified at runtime as :
>>>>
>>>> #!/bin/sh
>>>> #PBS -t 1-4<----------------- incorrect?
>>>> #PBS -V
>>>> #PBS -N ipengine
>>>> /usr/local/bin/python
>>>> /sps/glast/users/cohen/IPYDEV/ipython/IPython/parallel/apps/ipengineapp.py
>>>> profile_dir=/afs/in2p3.fr/home/t/tanugi/\
>>>> .ipython/profile_pbs
>>>>
>>>>
>>>> The problem I believe is in the job_array_template in  :
>>>>
>>>> class PBSLauncher(BatchSystemLauncher):
>>>>         """A BatchSystemLauncher subclass for PBS."""
>>>>
>>>>         submit_command = List(['qsub'], config=True,
>>>>             help="The PBS submit command ['qsub']")
>>>>         delete_command = List(['qdel'], config=True,
>>>>             help="The PBS delete command ['qsub']")
>>>>         job_id_regexp = Unicode(r'\d+', config=True,
>>>>             help="Regular expresion for identifying the job ID [r'\d+']")
>>>>
>>>>         batch_file = Unicode(u'')
>>>>         job_array_regexp = Unicode('#PBS\W+-t\W+[\w\d\-\$]+')
>>>>         job_array_template = Unicode('#PBS -t 1-{n}')
>>>>         queue_regexp = Unicode('#PBS\W+-q\W+\$?\w+')
>>>>         queue_template = Unicode('#PBS -q {queue}')
>>>>
>>>>
>>>> I looked at the PBS doc for version 10 and 11 and I did not see any '-t'
>>>> option. When I try to run, I get :
>>>> [tanugi at ccali28 test_directory]$ ipcluster start profile=pbs n=4
>>>> [IPClusterStart] Using existing profile dir:
>>>> u'/afs/in2p3.fr/home/t/tanugi/.ipython/profile_pbs'
>>>> [IPClusterStart] Starting ipcluster with [daemon=False]
>>>> [IPClusterStart] Creating pid file:
>>>> /afs/in2p3.fr/home/t/tanugi/.ipython/profile_pbs/pid/ipcluster.pid
>>>> [IPClusterStart] Starting PBSControllerLauncher: ['qsub',
>>>> u'/afs/in2p3.fr/home/t/tanugi/.ipython/profile_pbs/pbs_controller']
>>>> [IPClusterStart] adding job array settings to batch script
>>>> [IPClusterStart] Writing instantiated batch script:
>>>> /afs/in2p3.fr/home/t/tanugi/.ipython/profile_pbs/pbs_controller
>>>> unknown -t option
>>>> ERROR:root:Error in periodic callback
>>>> Traceback (most recent call last):
>>>>       File
>>>> "/sps/glast/users/cohen/IPYDEV/local/lib/python2.6/site-packages/zmq/eventloop/ioloop.py",
>>>> line 432, in _run
>>>>         self.callback()
>>>>       File
>>>> "/sps/glast/users/cohen/IPYDEV/ipython/IPython/parallel/apps/ipclusterapp.py",
>>>> line 364, in start_controller
>>>>         self.profile_dir.location
>>>>       File
>>>> "/sps/glast/users/cohen/IPYDEV/ipython/IPython/parallel/apps/launcher.py",
>>>> line 943, in start
>>>>         return super(PBSControllerLauncher, self).start(1, profile_dir)
>>>>       File
>>>> "/sps/glast/users/cohen/IPYDEV/ipython/IPython/parallel/apps/launcher.py",
>>>> line 902, in start
>>>>         job_id = self.parse_job_id(output)
>>>>       File
>>>> "/sps/glast/users/cohen/IPYDEV/ipython/IPython/parallel/apps/launcher.py",
>>>> line 854, in parse_job_id
>>>>         raise LauncherError("Job id couldn't be determined: %s" % output)
>>>> LauncherError: Job id couldn't be determined:
>>>>
>>>> Not sure yet about the traceback, but the "unknown -t option" is clear.
>>>> Furthermore, I wonder if it is really what we want to add lines to a
>>>> template file provided by the user?
>>>>
>>>> best,
>>>> Johann
>>>> _______________________________________________
>>>> IPython-dev mailing list
>>>> IPython-dev at scipy.org
>>>> http://mail.scipy.org/mailman/listinfo/ipython-dev
>>>>
>>> _______________________________________________
>>> IPython-dev mailing list
>>> IPython-dev at scipy.org
>>> http://mail.scipy.org/mailman/listinfo/ipython-dev
>>>
> _______________________________________________
> IPython-dev mailing list
> IPython-dev at scipy.org
> http://mail.scipy.org/mailman/listinfo/ipython-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/ipython-dev/attachments/20110706/c2263ccf/attachment.html>


More information about the IPython-dev mailing list