[IPython-dev] load on controller node

Wed Aug 29 11:59:53 EDT 2007

Glen,

I think what is going on is the following.  When you issue a pullAll
command, and that request reaches the engines, the engines each send
back the data.  Once that data has been sent to the controller, the
engine load goes back down.  But, now think about what the controller
has to do:

1) Receive all the data *from every engine*

2) Collate the resulting data from each engine into a single list of
pulled results

3) Send the data back to the RemoteController.

Steps 2 and 3 won't happen until after the engine load goes back down,
so I think that is what you are seeing.  The other aspect the
amplifies this effect, is that each engine only handles 1 object,
whereas the controller handles N objects (for N engines).  There is
simply a lot more for the controller to do.

One thing that this shows is that the controller can be a bottleneck
for certain types of algorithms.  Anytime, I end up with such a
bottleneck, I try to see if there are ways of moving more things onto
the engines and avoid the data movement through the controller.

With that said, I guess we need to submit the final version of the
slides by tomorrow.  Are you on target for that.  It is great to got
this working with 40 engines.

> So, I guess there is nothing shocking in the graph, although it would be
> interesting to see how things would change if the controller were able
> to use more than 1 CPU.

You can do this right now pretty easily - but I am not sure it is
worth it.  You could start two controllers and have 20 engines connect
to each.  Then in your client code you would simply create 2
RemoteControllers and write the algorithm in terms of those.  In the
future, we really need to create a MetaRemoteController object that
supports the notion of aggregating multiple RemoteController objects
into a single one.  But, even if you do this, it is possible that the
process running the RemoteController will still be a bottleneck.  Not
sure if you have time to explore all this.

Let me know if you need anything else.

Brian

> Best Regards,
> Glen Mabey
>
> BTW, the low time between 25 and 27 is a disk access, and I plan to
> reduce the effect of this by running a tandem set of engines that simply
> pull the data into the NFS cache ...  we'll see how that goes.
>
>
> _______________________________________________
> IPython-dev mailing list
> IPython-dev at scipy.org
> http://lists.ipython.scipy.org/mailman/listinfo/ipython-dev
>
>
>