[IPython-dev] 0.11rc1 : problem with tutorial for PBS in http://ipython.org/ipython-doc/dev/parallel/parallel_process.html
Johann Cohen-Tanugi
johann.cohentanugi at gmail.com
Wed Jul 6 08:33:36 EDT 2011
hi Min, so your fix worked, thanks. I managed to connect to SGE running
engines within an ipython session, and thus started to clone the code
for LSF. But I have the following issue : with the LSF farm I have at
hand, I am forced to use "bsub < batch_script" rather than "bsub
batch_script". I asked the admins about this and hope to have an answer
tonight. Note that in
, the same issue seems to have been encountered as I can see :
bsub < $1
So I tried
class LSFLauncher(BatchSystemLauncher):
"""A BatchSystemLauncher subclass for LSF."""
submit_command = List(['bsub >'], config=True,<....>
but this results in a crash :
-bash-3.2$ ipcluster start n=4 profile=lsf
[IPClusterStart] Using existing profile dir:
[IPClusterStart] Starting ipcluster with [daemon=False]
[IPClusterStart] Creating pid file:
[IPClusterStart] Starting LocalControllerLauncher:
'--log-to-file', 'log_level=20',
[IPClusterStart] Process
started: 16921
[IPClusterStart] Starting 4 engines
[IPClusterStart] Starting 4 engines with LSFEngineSetLauncher: ['bsub
\\<', u'/u/ec/cohen/.config/ipython/profile_lsf/lsf_engines']
[IPClusterStart] adding job array settings to batch script
[IPClusterStart] Writing instantiated batch script:
ERROR:root:Error in periodic callback
Traceback (most recent call last):
line 432, in _run
line 258, in start_engines
line 1049, in start
return super(LSFEngineSetLauncher, self).start(n, profile_dir)
line 902, in start
output = check_output(self.args, env=os.environ)
line 51, in check_output
p = Popen(*args, **kwargs)
line 633, in __init__
errread, errwrite)
line 1139, in _execute_child
raise child_exception
OSError: [Errno 2] No such file or directory
[IPClusterStart] [IPControllerApp] Using existing profile dir:
ERROR:IPClusterStart:[IPControllerApp] Using existing profile dir:
[IPClusterStart] Scheduler started [leastload]
ERROR:IPClusterStart:Scheduler started [leastload]
and the engines are not started in the batch....
On 07/05/2011 07:49 AM, Johann Cohen-Tanugi wrote:
> Thanks a lot for all your help Min.
> I will check your fix today. I think that for PBS -t should be replaced
> by -J, but I cannot test it as I realized yesterday that the batch
> system I put my hands on is derived from PBS, but is not PBS, and does
> not seem to honor -J. I will switch to SGE, then back to LSF with my
> updated launcher.py once I am sure I understand the SGE code that is
> already shipped with ipython and that you were able to test.
> best,
> Johann
> On 07/05/2011 01:56 AM, MinRK wrote:
>> On Mon, Jul 4, 2011 at 13:01, Johann Cohen-Tanugi
>> <johann.cohentanugi at gmail.com> wrote:
>>> good evening.... still trying to make the PBS batch parallel code work.
>>> I had to comment the "-t" line in launcher.py, but I am still puzzled by
>>> the fact that there is no loop over n to start n different engines. Is
>>> that because the '-t' was precisely there to create an array of subjobs?
>> Sorry about this, I was testing against Linux with Torque, which
>> supports job arrays via '-t'. How to start a collection of jobs is
>> going to vary from one PBS to another.
>> For instance, on some systems the default will be to use *no* job
>> array, and run multiple engines via aprun or mpiexec. I just looked,
>> and the addition of the jobarray line is unconditional, which is
>> definitely wrong. I just pushed a fix for that, so if you specify your
>> own template, it should not be changed at all by IPython.
>> A possible example:
>> #PBS -N ipython
>> #PBS -j oe
>> #PBS -l walltime=00:10:00
>> #PBS -l nodes={n/4}:ppn=4 # assumes 4-CPU nodes
>> #PBS -q {queue}
>> aprun -n {n} ipengine profile_dir={profile_dir}
>> This is if your system uses aprun, though mpiexec may be more
>> appropriate for you. Note that if you have PBS but no parallel
>> environments like mpi or ap, then you may have to do something like a
>> simple loop:
>> for i in {{ 1..{n} }}; do
>> ipengine profile_dir={profile_dir}
>> done
>> # note the double-brace. The templates use string.format, so to
>> escape braces you need to double them, so:
>> for i in {{ 1..{n} }}
>> becomes
>> for i in { 1..4 }
>> which in bash expands to 1 2 3 4
>> The templating should support any of these.
>>> Second question, more general : assuming the use of ipcluster, a
>>> controller and several engines are created; following the tutorial, all
>>> would actually run in batch, which seems strange to me for the
>>> controller : batch queues usually have time limits, and it is
>>> unavoidable that engines would die when the cpu time is exceeded, but I
>>> do not see why the controller should suffer from this. What would be the
>>> rational to execute the controller in batch rather than locally? Second
>>> question, once the engines run in batch, I presume that they listen to
>>> commands sent from any ipython session that I would interactively start,
>>> providing I use the Client() with the correct permissions in terms of
>>> ports,ssh etc.... Is that correct, id est is that indeed the idea?
>> You can choose to start the controller with batch or not, that's up to
>> you. There is no coupling at all between which launcher you choose
>> for the Controller and which you choose for the Engines. If you want
>> the controller to live longer than the batch system will allow, then
>> using the batch launcher for the controller is obviously the wrong
>> choice.
>> It does sometimes make sense to launch the controller with batch,
>> because you could have an entire job with controller, engines, and
>> clients *all* submitted via the batch system. I do this sometimes
>> with SGE for scaling tests using starcluster. It also helps
>> load-balancing for shared-node systems, since the controller should
>> take up a work slot. If you have a high throughput workload, you
>> don't want to run n engines + a controller on an n-cpu node, because
>> they will be fighting over resources.
>> Two more reasons to start the controller with batch: it will be
>> faster, and you can turn off your local machine. You can submit a
>> million jobs, turn off your local machine altogether, then connect
>> again later and retrieve your results. It will be faster, because the
>> majority of communication happens between the controller and the
>> engines, rather than the client and the controller.
>> -MinRK
>>> sorry to be dense about all that... I think it would be useful if the
>>> batch doc page was supplemented with the final step which amounts to
>>> starting an interactive ipython session and connecting to the batch engines.
>> Sure, I can add this, though it's not different from any method of
>> connecting to a controller from another machine.
>> If you are on the same system (e.g. in batch script or on a login
>> node), it will amount to:
>> ipython
>> In [1]: from IPython.parallel import Client
>> In [2]: rc = Client(profile='clusterprofile')
>> And if you are not, you will have to get the ipcontroller-client.json
>> file from profile_dir/security with scp, and do:
>> In [2]: rc = Client('/path/to/ipcontroller-client.json' )
>> possibly adding `sshserver='loginnode.example.com'` if you didn't
>> specify the ssh server for tunneling when starting the Controller.
>> -MinRK
>>> will continue digging,
>>> best.
>>> Johann
>>> On 07/04/2011 05:07 PM, Johann Cohen-Tanugi wrote:
>>>> hi there, my problem is in the fact that a line seems to be added to the
>>>> template I am defining following the tutorial :
>>>> the template proposed in the tutorial is modified at runtime as :
>>>> #!/bin/sh
>>>> #PBS -t 1-4<----------------- incorrect?
>>>> #PBS -V
>>>> #PBS -N ipengine
>>>> /usr/local/bin/python
>>>> /sps/glast/users/cohen/IPYDEV/ipython/IPython/parallel/apps/ipengineapp.py
>>>> profile_dir=/afs/in2p3.fr/home/t/tanugi/\
>>>> .ipython/profile_pbs
>>>> The problem I believe is in the job_array_template in :
>>>> class PBSLauncher(BatchSystemLauncher):
>>>> """A BatchSystemLauncher subclass for PBS."""
>>>> submit_command = List(['qsub'], config=True,
>>>> help="The PBS submit command ['qsub']")
>>>> delete_command = List(['qdel'], config=True,
>>>> help="The PBS delete command ['qsub']")
>>>> job_id_regexp = Unicode(r'\d+', config=True,
>>>> help="Regular expresion for identifying the job ID [r'\d+']")
>>>> batch_file = Unicode(u'')
>>>> job_array_regexp = Unicode('#PBS\W+-t\W+[\w\d\-\$]+')
>>>> job_array_template = Unicode('#PBS -t 1-{n}')
>>>> queue_regexp = Unicode('#PBS\W+-q\W+\$?\w+')
>>>> queue_template = Unicode('#PBS -q {queue}')
>>>> I looked at the PBS doc for version 10 and 11 and I did not see any '-t'
>>>> option. When I try to run, I get :
>>>> [tanugi at ccali28 test_directory]$ ipcluster start profile=pbs n=4
>>>> [IPClusterStart] Using existing profile dir:
>>>> u'/afs/in2p3.fr/home/t/tanugi/.ipython/profile_pbs'
>>>> [IPClusterStart] Starting ipcluster with [daemon=False]
>>>> [IPClusterStart] Creating pid file:
>>>> /afs/in2p3.fr/home/t/tanugi/.ipython/profile_pbs/pid/ipcluster.pid
>>>> [IPClusterStart] Starting PBSControllerLauncher: ['qsub',
>>>> u'/afs/in2p3.fr/home/t/tanugi/.ipython/profile_pbs/pbs_controller']
>>>> [IPClusterStart] adding job array settings to batch script
>>>> [IPClusterStart] Writing instantiated batch script:
>>>> /afs/in2p3.fr/home/t/tanugi/.ipython/profile_pbs/pbs_controller
>>>> unknown -t option
>>>> ERROR:root:Error in periodic callback
>>>> Traceback (most recent call last):
>>>> File
>>>> "/sps/glast/users/cohen/IPYDEV/local/lib/python2.6/site-packages/zmq/eventloop/ioloop.py",
>>>> line 432, in _run
>>>> self.callback()
>>>> File
>>>> "/sps/glast/users/cohen/IPYDEV/ipython/IPython/parallel/apps/ipclusterapp.py",
>>>> line 364, in start_controller
>>>> self.profile_dir.location
>>>> File
>>>> "/sps/glast/users/cohen/IPYDEV/ipython/IPython/parallel/apps/launcher.py",
>>>> line 943, in start
>>>> return super(PBSControllerLauncher, self).start(1, profile_dir)
>>>> File
>>>> "/sps/glast/users/cohen/IPYDEV/ipython/IPython/parallel/apps/launcher.py",
>>>> line 902, in start
>>>> job_id = self.parse_job_id(output)
>>>> File
>>>> "/sps/glast/users/cohen/IPYDEV/ipython/IPython/parallel/apps/launcher.py",
>>>> line 854, in parse_job_id
>>>> raise LauncherError("Job id couldn't be determined: %s" % output)
>>>> LauncherError: Job id couldn't be determined:
>>>> Not sure yet about the traceback, but the "unknown -t option" is clear.
>>>> Furthermore, I wonder if it is really what we want to add lines to a
>>>> template file provided by the user?
>>>> best,
>>>> Johann
>>>> _______________________________________________
>>>> IPython-dev mailing list
>>>> IPython-dev at scipy.org
>>>> http://mail.scipy.org/mailman/listinfo/ipython-dev
>>> _______________________________________________
>>> IPython-dev mailing list
>>> IPython-dev at scipy.org
>>> http://mail.scipy.org/mailman/listinfo/ipython-dev
> _______________________________________________
> IPython-dev mailing list
> IPython-dev at scipy.org
> http://mail.scipy.org/mailman/listinfo/ipython-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/ipython-dev/attachments/20110706/c2263ccf/attachment.html>
More information about the IPython-dev
mailing list