[pypy-dev] [ANN] Python compilers workshop at SciPy this year

Thu Mar 24 08:22:33 EDT 2016

Besides JPype and PyJNIus there is also https://www.py4j.org/.  I haven't
heard of JPype being used in any recent projects so I assuming it is
outdated by now.  PyJNIus gets used but I tend to only see it used on
Android projects.  The Py4J project gets used often in numerical/scientific
projects mainly due to it use in PySpark.  The problem with all these
libraries is that they don't have a way to share large amounts of memory
between the JVM and Python VMs and so large chunks of data have to be
copied/serialized when going between the 2 VMs.

Spark is the de facto standard in clustering computing at this point in
time.  At a high level Spark executes code that is distributed throughout a
cluster so that the code being executed is as close as possible to where
the data lives so as to minimize transferring of large amounts of data.
The code that needs to be executed are packaged up into units called
Resilient Distributed Dataset (RDD).  RDDs are lazy evaluated and are
essential graphs of the operations that need to be performed on the data.
They are capable of reading data from many types of sources, outputting to
multiple types of sources, containing the code that needs to be executed,
and are also responsible to caching or keeping results in memory for future
RDDs that maybe executed.

If you write all your code in Java or Scala, its execution will be
performed in JVMs distributed in the cluster.  On the other hand, Spark
does not limit its use to only Java based languages so Python can be used.
In the case of Python the PySpark library is used.  When Python is used,
the PySpark library can be used to define the RDDs that will be executed
under the JVM.  In this scenario, only if required, the final results of
the calculations will end up being passed to Python.  I say only if
necessary as its possible the end results may just be left in memory or to
create an output such as an hdfs file in hadoop and does not need to be
transferred to Python. Under this scenario the code is written in Python
but effectively all the "real" work is performed under the JVM.

Often someone writing Python is also going to want to perform some of the
operations under Python.  This can be done as the RDDs that are created can
contain both operations that get performed under the JVM as well as Python
(and of course other languages are supported).  When Python is involved
Spark will start up Python VMs on the required nodes so that the Python
portions of the work can be performed.  The Python VMs can either be
CPython, PyPy or even a mix of both CPython and PyPy.  The downside to
using non Java languages is the overhead of passing data between the JVM
and the Python VM as the memory is not shared between the processes but
instead copied/serialized between them.

Because this data is copied between the 2 VMs, anyone who writes Python
code for this environment always has to be conscious of the data being
copied between the processes so as to not let the amount of the extra
overhead become a large burden.  Quite often the goal will be to first
perform the bulk of the operations under the JVM and then hopefully only a
smaller subset of the data will have to be processed under Python.  If this
can be done then the overhead can be minimized and then there is essential
no down sides to using Python in the pipeline of operations.

If your unfortunate and need to perform some of the processing early in the
pipline under Python and worse yet if there is a need to go back and forth
many times between Python and Java the overhead of coping huge amounts of
data can significantly slow things down which essentially puts Python at a
disadvantage to Java.

If it was possible to change the model of execution such that it was
possible to embed the Python VM in the JVM or vice versa and that the
memory could be shared between the 2 VMs the downside of using Python in
this environment would be eliminated or at the very least minimized to the
point where it is no longer an issue.  Thus the need for a jffi library.

There is a strong desire by many to use dynamic languages in these
clustered environments and Python is likely in the best position to become
the language of choice due to its ability to work with C based libraries
and of course its syntax.  The issues that hold Python back at this point
is the serialization overhead, not so great state of packaging, and not
having both the speed of the JIT and complete access to numpy/scipy
ecosystem.

Luckily for Python at this point there is no other dynamic language that is
a clear winner today.  But if too much time passes before these issues are
solved I'm sure another language will step up to the plate.  At this point
my expectations is that Node could likely make a move.  It already has the
speed due to the Java Script JITs, it already has a great story for
packaging and deployment, and its growth is exploding on the server side
due to all the money being poured into it.  What it strongly lacks today is
the connection to C/legacy code, numerical/scientific modules and of course
it also does not have a solution to the data copying overhead it also has
with the JVM.

Any way, this is just my 2 cents on what is currently holding Python back
from taking off in this space.

On Thu, Mar 24, 2016 at 2:32 AM, Hakan Ardo <hakan.ardo at gmail.com> wrote:

>
> On Mar 23, 2016 21:49, "Armin Rigo" <arigo at tunes.org> wrote:
> >
> > Hi John,
> >
> > On 23 March 2016 at 19:16, John Camara <john.m.camara at gmail.com> wrote:
> > > I would like to suggest one more topic for the workshop. I see a big
> need
> > > for a library (jffi) similar to cffi but that provides a bridge to Java
> > > instead of C code. The ability to seamlessly work with native Java
> data/code
> > > would offer a huge improvement (...)
> >
> > Isn't it what JPype does?  Can you describe how it isn't suitable for
> > your needs?
>
> There is also PyJNIus:
>
>     https://pyjnius.readthedocs.org/en/latest/
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pypy-dev/attachments/20160324/52c9c8d2/attachment-0001.html>