hi brian,<br><br>thanks for the responses. i'll touch on a few of them.<br><br>
<div class="gmail_quote"><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;"><div class="im">> * optionally offload the dag directly to the underlying scheduler if it has<br>
> dependency support (i.e., SGE, Torque/PBS, LSF)<br>
<br>
</div>While we could support this, I actually think it would be a step<br>
backwards. <br></blockquote><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">... <br></blockquote><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">
All of this means lots and lots of latency for each task in the DAG.<br>
For tasks that have lots of data or lots of Python modules to import,<br>
that will simply kill the parallel speedup you will get (ala Amdahl's<br>
law).<br></blockquote><div><br>here is the scenario where this becomes a useful thing (and hence to optionally have it). let's say under sge usage you have started 10 clients/ipengines. now at the time of creating the clients one machine with 10 allocations was free and sge routed all the 10 clients to that machine. now this will be the machine that will be used for all ipcluster processing. whereas if the node distribution and ipengine startup were to happen simultaneously at the level of the sge scheduler, processes would get routed to the best available slot at the time of execution. <br>
<br>i agree that in several other scenarios, the current mechanism works great. but this is a common scenario that we have run into in a heavily used cluster (limited nodes + lots of users).<br> </div><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">
<div class="im">> * something we currently do in nipype is that we provide a configurable<br>
> option to continue processing if a given node fails. we simply remove the<br>
> dependencies of the node from further execution and generate a report at the<br>
> end saying which nodes crashed.<br>
<br>
</div>I guess I don't see how it was a true dependency then. Is this like<br>
an optional dependency? What are the usage cases for this.<br></blockquote><div><br>perhaps i misunderstood what happens in the current implementation. if you have a DAG such as (A,B) (B,E) (A,C) (C,D) and let's say C fails, does the current dag controller continue executing B,E? or does it crash at the first failure. we have the option to go either way in nipype. if something crashes, stop or if something crashes, process all things that are not dependent on the crash.<br>
</div><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">
<div class="im">> * callback support for node: node_started_cb, node_finished_cb<br>
<br>
</div>I am not sure we could support this, because once you create the DAG<br>
and send it to the scheduler, the tasks are out of your local Python<br>
session. IOW, there is really no place to call such callbacks.<br></blockquote><div><br>i'll have to think about this one a little more. one use case for this is reporting where things stand within the execution graph (perhaps the scheduler can report this, although now, i'm back to polling instead of being called back.)<br>
</div><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">
<div>> * support for nodes themselves being DAGs <br></div></blockquote><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;"><div>... <br></div>
</blockquote><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;"><div class="im">
</div>I think for the node is a DAG case, we would just flatten that at<br>
submission time. IOW, apply the transformation:<br>
<br>
A DAG of nodes, each of which may be a DAG => A DAG of node.<br>
<br>
Would this work?<br></blockquote><div><br>this would work, i think we have a slightly more complicated case of this implemented in nipype, but perhaps i need to think about it again. our case is like a maptask, where the same thing operates on a list of inputs and then we collate the outputs back. but as a general purpose mechanism, you should not worry about this use case now.<br>
</div><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">
<div class="im">Yes, it does make sense to support DRMAA in ipcluster. Once Min's<br></div>
stuff has been merged into master, we will begin to get it working<br>
with the batch systems again.<br></blockquote><div><br>great.<br><br>cheers,<br><br>satra<br><br></div></div>