[pypy-dev] [ANN] Python compilers workshop at SciPy this year

David Edelsohn dje.gcc at gmail.com
Thu Mar 24 12:31:46 EDT 2016


Maciej,

How about a little more useful response of "we'll help you find the
right audience for this discussion and collaborate with you to make
the case."?

- David

On Thu, Mar 24, 2016 at 11:32 AM, Maciej Fijalkowski <fijall at gmail.com> wrote:
> Ok fine, but we're not the receipents of such a message.
>
> Please lobby PSF for having a JIT, we all support that :-)
>
> On Thu, Mar 24, 2016 at 5:23 PM, John Camara <john.m.camara at gmail.com> wrote:
>> Hi Fijal,
>>
>> I understand where your coming from and not trying to convince you to work
>> on it.  Just mainly trying to point out a need that may not be obvious to
>> this community.  I don't spend much time on big data and analytics so I
>> don't have a lot of time to devote to this task.  That could change in the
>> future so you never know I may end up getting involved with this.
>>
>> At the end of the day I think it is the PSF, which needs to do an honest
>> assessment of the current state of Python and in programming in general, so
>> that they can help direct the future of Python.  I think with an honest
>> assessment it should be clear that it is absolutely necessary that a dynamic
>> language have a JIT. Otherwise, a language like Node would not be growing so
>> quickly on the server side.  An honest assessment would conclude that Python
>> needs to play a major role in big data and analytics as we don't want this
>> to be another area where Python misses the boat.  As with all languages
>> other than JavaScript we missed playing an important role on web front end.
>> More recently we missed out on mobile.  I don't think it is good for us to
>> miss out on big data.  It would be a shame since we had such a strong
>> scientific community which initially gave us a huge advantage over other
>> communities.  Missing out on big data might also be the driver that moves
>> the scientific community in a different direction which would be a big loss
>> to Python.
>>
>> I personally don't see any particular companies or industries that are
>> willing to fund the tasks needed to solve these issues.  It's not to say
>> there are no more funds for Python projects its just likely no one company
>> will be willing to fund these kinds of projects on their own.  It really
>> needs the PSF to coordinate these efforts but they seamed to be more focus
>> on trying to make Python 3 a success instead of improving the overall health
>> of the community.
>>
>> I believe that Python is in pretty good shape in being able to solve these
>> issues but it just needs some funding and focus to get there.
>>
>> Hopefully the workshop will be successful and help create some focus.
>>
>> John
>>
>> On Thu, Mar 24, 2016 at 8:56 AM, Maciej Fijalkowski <fijall at gmail.com>
>> wrote:
>>>
>>> Hi John
>>>
>>> Thanks for explaining the current situation of the ecosystem. I'm not
>>> quite sure what your intention is. PyPy (and CPython) is very easy to
>>> embed through any C-level API, especially with the latest additions to
>>> cffi embedding. If someone feels like doing the work to share stuff
>>> that way (as I presume a lot of data presented in JVM can be
>>> represented as some pointer and shape how to access it), then he's
>>> obviously more than free to do so, I'm even willing to help with that.
>>> Now this seems like a medium-to-big size project that additionally
>>> will require quite a bit of community will to endorse. Are you willing
>>> to volunteer to work on such a project and dedicate a lot of time to
>>> it? If not, then there is no way you can convince us to volunteer our
>>> own time to do it - it's just too big and quite a bit far out of our
>>> usual areas of interest. If there is some commercial interest (and I
>>> think there might be) in pushing python and especially pypy further in
>>> that area, we might want to have a better story for numpy first, but
>>> then feel free to send those corporate interest people my way, we can
>>> maybe organize something. If you want us to do community service to
>>> push Python solutions in the area I have very little clue about
>>> however, I would like to politely decline.
>>>
>>> Cheers,
>>> fijal
>>>
>>> On Thu, Mar 24, 2016 at 2:22 PM, John Camara <john.m.camara at gmail.com>
>>> wrote:
>>> > Besides JPype and PyJNIus there is also https://www.py4j.org/.  I
>>> > haven't
>>> > heard of JPype being used in any recent projects so I assuming it is
>>> > outdated by now.  PyJNIus gets used but I tend to only see it used on
>>> > Android projects.  The Py4J project gets used often in
>>> > numerical/scientific
>>> > projects mainly due to it use in PySpark.  The problem with all these
>>> > libraries is that they don't have a way to share large amounts of memory
>>> > between the JVM and Python VMs and so large chunks of data have to be
>>> > copied/serialized when going between the 2 VMs.
>>> >
>>> > Spark is the de facto standard in clustering computing at this point in
>>> > time.  At a high level Spark executes code that is distributed
>>> > throughout a
>>> > cluster so that the code being executed is as close as possible to where
>>> > the
>>> > data lives so as to minimize transferring of large amounts of data.  The
>>> > code that needs to be executed are packaged up into units called
>>> > Resilient
>>> > Distributed Dataset (RDD).  RDDs are lazy evaluated and are essential
>>> > graphs
>>> > of the operations that need to be performed on the data.  They are
>>> > capable
>>> > of reading data from many types of sources, outputting to multiple types
>>> > of
>>> > sources, containing the code that needs to be executed, and are also
>>> > responsible to caching or keeping results in memory for future RDDs that
>>> > maybe executed.
>>> >
>>> > If you write all your code in Java or Scala, its execution will be
>>> > performed
>>> > in JVMs distributed in the cluster.  On the other hand, Spark does not
>>> > limit
>>> > its use to only Java based languages so Python can be used.  In the case
>>> > of
>>> > Python the PySpark library is used.  When Python is used, the PySpark
>>> > library can be used to define the RDDs that will be executed under the
>>> > JVM.
>>> > In this scenario, only if required, the final results of the
>>> > calculations
>>> > will end up being passed to Python.  I say only if necessary as its
>>> > possible
>>> > the end results may just be left in memory or to create an output such
>>> > as an
>>> > hdfs file in hadoop and does not need to be transferred to Python. Under
>>> > this scenario the code is written in Python but effectively all the
>>> > "real"
>>> > work is performed under the JVM.
>>> >
>>> > Often someone writing Python is also going to want to perform some of
>>> > the
>>> > operations under Python.  This can be done as the RDDs that are created
>>> > can
>>> > contain both operations that get performed under the JVM as well as
>>> > Python
>>> > (and of course other languages are supported).  When Python is involved
>>> > Spark will start up Python VMs on the required nodes so that the Python
>>> > portions of the work can be performed.  The Python VMs can either be
>>> > CPython, PyPy or even a mix of both CPython and PyPy.  The downside to
>>> > using
>>> > non Java languages is the overhead of passing data between the JVM and
>>> > the
>>> > Python VM as the memory is not shared between the processes but instead
>>> > copied/serialized between them.
>>> >
>>> > Because this data is copied between the 2 VMs, anyone who writes Python
>>> > code
>>> > for this environment always has to be conscious of the data being copied
>>> > between the processes so as to not let the amount of the extra overhead
>>> > become a large burden.  Quite often the goal will be to first perform
>>> > the
>>> > bulk of the operations under the JVM and then hopefully only a smaller
>>> > subset of the data will have to be processed under Python.  If this can
>>> > be
>>> > done then the overhead can be minimized and then there is essential no
>>> > down
>>> > sides to using Python in the pipeline of operations.
>>> >
>>> > If your unfortunate and need to perform some of the processing early in
>>> > the
>>> > pipline under Python and worse yet if there is a need to go back and
>>> > forth
>>> > many times between Python and Java the overhead of coping huge amounts
>>> > of
>>> > data can significantly slow things down which essentially puts Python at
>>> > a
>>> > disadvantage to Java.
>>> >
>>> > If it was possible to change the model of execution such that it was
>>> > possible to embed the Python VM in the JVM or vice versa and that the
>>> > memory
>>> > could be shared between the 2 VMs the downside of using Python in this
>>> > environment would be eliminated or at the very least minimized to the
>>> > point
>>> > where it is no longer an issue.  Thus the need for a jffi library.
>>> >
>>> > There is a strong desire by many to use dynamic languages in these
>>> > clustered
>>> > environments and Python is likely in the best position to become the
>>> > language of choice due to its ability to work with C based libraries and
>>> > of
>>> > course its syntax.  The issues that hold Python back at this point is
>>> > the
>>> > serialization overhead, not so great state of packaging, and not having
>>> > both
>>> > the speed of the JIT and complete access to numpy/scipy ecosystem.
>>> >
>>> > Luckily for Python at this point there is no other dynamic language that
>>> > is
>>> > a clear winner today.  But if too much time passes before these issues
>>> > are
>>> > solved I'm sure another language will step up to the plate.  At this
>>> > point
>>> > my expectations is that Node could likely make a move.  It already has
>>> > the
>>> > speed due to the Java Script JITs, it already has a great story for
>>> > packaging and deployment, and its growth is exploding on the server side
>>> > due
>>> > to all the money being poured into it.  What it strongly lacks today is
>>> > the
>>> > connection to C/legacy code, numerical/scientific modules and of course
>>> > it
>>> > also does not have a solution to the data copying overhead it also has
>>> > with
>>> > the JVM.
>>> >
>>> > Any way, this is just my 2 cents on what is currently holding Python
>>> > back
>>> > from taking off in this space.
>>> >
>>> > On Thu, Mar 24, 2016 at 2:32 AM, Hakan Ardo <hakan.ardo at gmail.com>
>>> > wrote:
>>> >>
>>> >>
>>> >> On Mar 23, 2016 21:49, "Armin Rigo" <arigo at tunes.org> wrote:
>>> >> >
>>> >> > Hi John,
>>> >> >
>>> >> > On 23 March 2016 at 19:16, John Camara <john.m.camara at gmail.com>
>>> >> > wrote:
>>> >> > > I would like to suggest one more topic for the workshop. I see a
>>> >> > > big
>>> >> > > need
>>> >> > > for a library (jffi) similar to cffi but that provides a bridge to
>>> >> > > Java
>>> >> > > instead of C code. The ability to seamlessly work with native Java
>>> >> > > data/code
>>> >> > > would offer a huge improvement (...)
>>> >> >
>>> >> > Isn't it what JPype does?  Can you describe how it isn't suitable for
>>> >> > your needs?
>>> >>
>>> >> There is also PyJNIus:
>>> >>
>>> >>     https://pyjnius.readthedocs.org/en/latest/
>>> >
>>> >
>>> >
>>> > _______________________________________________
>>> > pypy-dev mailing list
>>> > pypy-dev at python.org
>>> > https://mail.python.org/mailman/listinfo/pypy-dev
>>> >
>>
>>
> _______________________________________________
> pypy-dev mailing list
> pypy-dev at python.org
> https://mail.python.org/mailman/listinfo/pypy-dev


More information about the pypy-dev mailing list