[IPython-dev] pushAll and executeAll speed issue

Wed Nov 8 23:29:09 EST 2006

Hey Albert,

Brian will probably pitch in later with more comments, since he knows
the underbelly of this code better than I do.  But in the meantime...

On 11/8/06, Albert Strasheim <fullung at gmail.com> wrote:
> Hello all
>
> I'm doing some experiments with IPython1 on Windows. I'm using DeinoMPI's
> GUI on top of mpi4py to start two kernels on my Core 2 Duo system. They
> connect to the controller without problems. I make a RemoteController and
> off I go. This is too cool.
>
> (By the way: as I understand it, MPI is currently only being used to start
> the kernels. From what I can see, it is not yet being used inside things
> like pushAll, etc. Is this the next step to use the MPI functions if they
> are available? Anybody working on this? If not, maybe I can try my hand at
> it. Any hints or suggestions on where to start appreciated.)

There is MPI code in there already, but currently it is not being used
for many core things.  We should intelligently take advantage of it
when present, while gracefully falling back to pickle when MPI is not
there.

> Anyway, back to my experiment. The script I'm using is attached.
>
> It starts off with some local benchmarks running on a single core. I get:
>
> cPickle dumps time 4.95649055955
> cPickle loads time 2.39126201794
> local map time 10.3841574837
>
> Then I push to all and time it. Here I get:
>
> total pushAll time 70.6287321433
>
> I was expecting something proportional to dumpsTime+N*loadsTime (if I
> understand correctly that it is serially pushing out the data to each of the
> N kernels) or maybe even less. I noticed that this section gets dramatically
> slower even with a relatively small increase in data size. While this is
> busy, I see Python using up all the CPU on one core as expected, but I also
> see that the Windows "System" process is very busy.
>
> Next up, the timing over the executeAll call that does the map:
>
> executeAll time for map 139.186242386
>
> but the kernels print:
>
> engine map time 12.8248084603
> engine map time 11.3900467797
>
> which is only slightly more than the single core time. However, I have to
> wonder where the heck the rest of the time went?! :-)
>
> Do you guys see similar timings on Linux? Any thoughts on where my time is
> going? Could it be Twisted? I tried with the latest from SVN and the
> official 2.4.0 release for Windows.

First, a numpy aside.  Note that I had to change the original line

x = N.arange...

for x = range(), because otherwise I was getting zillions of

Warning: overflow encountered in long_scalars

It seems to come from this: numpy long scalars don't gracefully coerce
to Python longs:

In [7]: x[-1]
Out[7]: 99999

In [8]: x[-1]*x[-1]
Warning: overflow encountered in long_scalars
Out[8]: 1409865409

while the answer should be:

In [22]: y=99999

In [23]: y*y
Out[23]: 9999800001L

But that's for the numpy list :)

Here's what is going on, most likely. The size of that enormous
pickled list is very large:

In [24]: x = range(10000000)

In [25]: len(cPickle.dumps(x))
Out[25]: 98888896

That's a 100MB string, that needs to be shipped over the wire.  At
this point, we are probably being hurt by Twisted, which makes
unnecessary copies of very large objects when transmitting them.  I
know Brian and Min spent some time looking at this, but I'm not sure
how far they got; I think it wasn't too easy to fix directly, though
it /will/ need to be done eventually.

I see two possibilities:

a) Finding how to make Twisted-based sending of pickled strings as
efficient as possible, and with minimal copies.

b) Taking better advantage of MPI if available, so that the string
buffer is sent using MPI directly.

I see fairly similar results to yours:

In [10]: run frags9.py
('127.0.0.1', 10105)
Connecting to controller:  ('127.0.0.1', 10105)
[0, 1]
cPickle dumps time 5.15
cPickle loads time 5.92
local map time 7.06
[127.0.0.1:0] In [29]: import time
[127.0.0.1:1] In [22]: import time
pushing to all...
total pushAll time 1.24
[127.0.0.1:0] In [30]: t = time.clock()
[127.0.0.1:1] In [23]: t = time.clock()
executing on all...
[127.0.0.1:0] In [31]: z = map(lambda y:y*y, x)
[127.0.0.1:1] In [24]: z = map(lambda y:y*y, x)
executeAll time for map 0.0
[127.0.0.1:0] In [32]: t = time.clock() - t
[127.0.0.1:1] In [25]: t = time.clock() - t
total time for push and execute 1.24
[127.0.0.1:0] In [33]: print "engine map time", t
[127.0.0.1:0] Out[33]: engine map time 7.16
[127.0.0.1:1] In [26]: print "engine map time", t
[127.0.0.1:1] Out[26]: engine map time 7.19
(0, 33, 'print "engine map time", t', 'engine map time 7.16\n', '')
(1, 26, 'print "engine map time", t', 'engine map time 7.19\n', '')

I also (under linux) saw an enormous amount of CPU usage from the
kernel; that's typical of large memory allocations being done.
Further, by the end of the runs my 2GB box was almost out of RAM.

For usage of read-only data in multicore machines, it would also be
nice to expose a shared-memory interface, but I'm not really up to
speed on the details of doing that, and whether Python exposes an API
to allow it to be done portably.

Basically, I think that we're starting to see where the performance
bottlenecks are.  This is pre-alpha, after all ;)

We'll keep working on it...

Cheers,

f