[IPython-dev] IPython.parallel slow push

Moritz Beber moritz.beber at gmail.com
Tue Aug 12 06:31:15 EDT 2014


On Tue, Aug 12, 2014 at 12:38 AM, Fernando Perez <fperez.net at gmail.com>
wrote:

> On Mon, Aug 11, 2014 at 6:56 AM, Wes Turner <wes.turner at gmail.com> wrote:
>
>> This [2] seems to suggest that anything that isn't a buffer,
>> str/bytes, or numpy array is pickled and copied.
>>
>
> That is indeed correct.
>
>
>>  Would it be faster to ETL into something like HDF5 (e.g. w/
>> Pandas/PyTables) and just synchronize the dataset URI?
>>
>
> Absolutely.
>
> IPython.parallel is NOT the right tool to use to move large amounts of
> data around between machines. It's an important problem in
> parallel/distributed computing, but also a very challenging one that is
> beyond our scope and resources.
>

As I said, I didn't move anything between machines, just locally. Still it
uses ZMQ and I get that IPython is not meant to handle this situation.
Simply using a shelve (relying on pickle here) and loading the contents in
each kernel already improved the time needed a lot.


>
> When using IPython.parallel, you should think of it as a good way to
>
> - coordinate computation
> - move code around
> - move *small* data around
> - have interactive control in parallel settings
>
> But you should have a non-IPython strategy for moving big chunks of data
> around. The right answer to that question will vary from one context to
> another. In some cases a simple NFS mount may be enough, elsewhere
> something like Hadoop FS or Disco FS may work, or a well-sharded database,
> or whatever.
>
> But it's simply a problem that we consider orthogonal to what
> IPython.parallel can do well.
>
> Hope this helps,
>
> f
>
> Thank you for your input.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/ipython-dev/attachments/20140812/9349fc38/attachment.html>


More information about the IPython-dev mailing list