[IPython-dev] pyspark and IPython

Nitin Borwankar nborwankar at gmail.com
Thu Aug 29 17:41:07 EDT 2013

Hi Brian,

The advantage IMHO is that pyspark and the larger UCB AMP effort are a huge
open source effort for distributed parallel computing that improves upon
the Hadoop model. Spark the underlying layer + Shark the Hive compatible
query language adds performance gains of 10x - 100x.  The effort has 20+
companies contributing code including Yahoo and 70+ contributors. AMP has a
10M$ grant from NSF.  So
a) it's not going away soon
b) it may be hard to compete with it without that level of resources
c) they do have a Python shell (have not used it yet) and they appear
committed to have Python as a first class language in their effort.
d) lets see if we can find ways to integrate with it.

I think integration at the level of the interactive interface might make

Just my 2c but I think this effort may leapfrog pure Hadoop over the next
2-3 years.


Nitin Borwankar
nborwankar at gmail.com

On Thu, Aug 29, 2013 at 1:35 PM, Brian Granger <ellisonbg at gmail.com> wrote:

> >From a quick glance, it looks like both pyspark and IPython use
> similar parallel computing models in terms of the process model.  You
> might think that would help them to integrate, but in this case I
> think it will get in the way of integration.  Without learning more
> about the low-level details of their architecture it is really
> difficult to know if it is possible or not.  But I think the bigger
> question is what would the motivation for integration be?  Both
> IPython and spark provide self-contained parallel computing
> capabilties - what usage cases are there for using both at the same
> time?  I think the biggest potential show stopper is that pyspark is
> not designed in any way to be interactive as far as I can tell.
> Pyspark jobs basically run in batch mode, which is going to make it
> really tough to fit into IPython's interactive model.  Worth looking
> more into though..
> Cheers,
> Brian
> On Thu, Aug 29, 2013 at 11:28 AM, Nitin Borwankar <nborwankar at gmail.com>
> wrote:
> > I'm at AmpCamp3 at UCB and see that there would be huge benefits to
> > integrating pyspark with IPython and IPyNB.
> >
> > Questions:
> >
> > a) has this been attempted/done? if so pointers pl.
> >
> > b) does this overlap the IPyNB parallel computing effort in
> > conflicting/competing ways?
> >
> > c) if this has not been done yet - does anyone have a sense of how much
> > effort this might be? (I've done a small hack integrating postgres psql
> into
> > ipynb so I'm not terrified by that level of deep digging, but are there
> any
> > show stopper gotchas?)
> >
> > Thanks much,
> >
> > Nitin
> > ------------------------------------------------------------------
> > Nitin Borwankar
> > nborwankar at gmail.com
> >
> > _______________________________________________
> > IPython-dev mailing list
> > IPython-dev at scipy.org
> > http://mail.scipy.org/mailman/listinfo/ipython-dev
> >
> --
> Brian E. Granger
> Cal Poly State University, San Luis Obispo
> bgranger at calpoly.edu and ellisonbg at gmail.com
> _______________________________________________
> IPython-dev mailing list
> IPython-dev at scipy.org
> http://mail.scipy.org/mailman/listinfo/ipython-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/ipython-dev/attachments/20130829/2ae98632/attachment.html>

More information about the IPython-dev mailing list