[pypy-dev] [ANN] Python compilers workshop at SciPy this year
Maciej Fijalkowski
fijall at gmail.com
Thu Mar 24 15:48:48 EDT 2016
Hi David
I'm sorry, it was not supposed to come as rude.
It seems that the blocker here is full numpy support which we're
working on right now, we can come back to that discussion once that's
ready
On Thu, Mar 24, 2016 at 6:31 PM, David Edelsohn <dje.gcc at gmail.com> wrote:
> Maciej,
>
> How about a little more useful response of "we'll help you find the
> right audience for this discussion and collaborate with you to make
> the case."?
>
> - David
>
> On Thu, Mar 24, 2016 at 11:32 AM, Maciej Fijalkowski <fijall at gmail.com> wrote:
>> Ok fine, but we're not the receipents of such a message.
>>
>> Please lobby PSF for having a JIT, we all support that :-)
>>
>> On Thu, Mar 24, 2016 at 5:23 PM, John Camara <john.m.camara at gmail.com> wrote:
>>> Hi Fijal,
>>>
>>> I understand where your coming from and not trying to convince you to work
>>> on it. Just mainly trying to point out a need that may not be obvious to
>>> this community. I don't spend much time on big data and analytics so I
>>> don't have a lot of time to devote to this task. That could change in the
>>> future so you never know I may end up getting involved with this.
>>>
>>> At the end of the day I think it is the PSF, which needs to do an honest
>>> assessment of the current state of Python and in programming in general, so
>>> that they can help direct the future of Python. I think with an honest
>>> assessment it should be clear that it is absolutely necessary that a dynamic
>>> language have a JIT. Otherwise, a language like Node would not be growing so
>>> quickly on the server side. An honest assessment would conclude that Python
>>> needs to play a major role in big data and analytics as we don't want this
>>> to be another area where Python misses the boat. As with all languages
>>> other than JavaScript we missed playing an important role on web front end.
>>> More recently we missed out on mobile. I don't think it is good for us to
>>> miss out on big data. It would be a shame since we had such a strong
>>> scientific community which initially gave us a huge advantage over other
>>> communities. Missing out on big data might also be the driver that moves
>>> the scientific community in a different direction which would be a big loss
>>> to Python.
>>>
>>> I personally don't see any particular companies or industries that are
>>> willing to fund the tasks needed to solve these issues. It's not to say
>>> there are no more funds for Python projects its just likely no one company
>>> will be willing to fund these kinds of projects on their own. It really
>>> needs the PSF to coordinate these efforts but they seamed to be more focus
>>> on trying to make Python 3 a success instead of improving the overall health
>>> of the community.
>>>
>>> I believe that Python is in pretty good shape in being able to solve these
>>> issues but it just needs some funding and focus to get there.
>>>
>>> Hopefully the workshop will be successful and help create some focus.
>>>
>>> John
>>>
>>> On Thu, Mar 24, 2016 at 8:56 AM, Maciej Fijalkowski <fijall at gmail.com>
>>> wrote:
>>>>
>>>> Hi John
>>>>
>>>> Thanks for explaining the current situation of the ecosystem. I'm not
>>>> quite sure what your intention is. PyPy (and CPython) is very easy to
>>>> embed through any C-level API, especially with the latest additions to
>>>> cffi embedding. If someone feels like doing the work to share stuff
>>>> that way (as I presume a lot of data presented in JVM can be
>>>> represented as some pointer and shape how to access it), then he's
>>>> obviously more than free to do so, I'm even willing to help with that.
>>>> Now this seems like a medium-to-big size project that additionally
>>>> will require quite a bit of community will to endorse. Are you willing
>>>> to volunteer to work on such a project and dedicate a lot of time to
>>>> it? If not, then there is no way you can convince us to volunteer our
>>>> own time to do it - it's just too big and quite a bit far out of our
>>>> usual areas of interest. If there is some commercial interest (and I
>>>> think there might be) in pushing python and especially pypy further in
>>>> that area, we might want to have a better story for numpy first, but
>>>> then feel free to send those corporate interest people my way, we can
>>>> maybe organize something. If you want us to do community service to
>>>> push Python solutions in the area I have very little clue about
>>>> however, I would like to politely decline.
>>>>
>>>> Cheers,
>>>> fijal
>>>>
>>>> On Thu, Mar 24, 2016 at 2:22 PM, John Camara <john.m.camara at gmail.com>
>>>> wrote:
>>>> > Besides JPype and PyJNIus there is also https://www.py4j.org/. I
>>>> > haven't
>>>> > heard of JPype being used in any recent projects so I assuming it is
>>>> > outdated by now. PyJNIus gets used but I tend to only see it used on
>>>> > Android projects. The Py4J project gets used often in
>>>> > numerical/scientific
>>>> > projects mainly due to it use in PySpark. The problem with all these
>>>> > libraries is that they don't have a way to share large amounts of memory
>>>> > between the JVM and Python VMs and so large chunks of data have to be
>>>> > copied/serialized when going between the 2 VMs.
>>>> >
>>>> > Spark is the de facto standard in clustering computing at this point in
>>>> > time. At a high level Spark executes code that is distributed
>>>> > throughout a
>>>> > cluster so that the code being executed is as close as possible to where
>>>> > the
>>>> > data lives so as to minimize transferring of large amounts of data. The
>>>> > code that needs to be executed are packaged up into units called
>>>> > Resilient
>>>> > Distributed Dataset (RDD). RDDs are lazy evaluated and are essential
>>>> > graphs
>>>> > of the operations that need to be performed on the data. They are
>>>> > capable
>>>> > of reading data from many types of sources, outputting to multiple types
>>>> > of
>>>> > sources, containing the code that needs to be executed, and are also
>>>> > responsible to caching or keeping results in memory for future RDDs that
>>>> > maybe executed.
>>>> >
>>>> > If you write all your code in Java or Scala, its execution will be
>>>> > performed
>>>> > in JVMs distributed in the cluster. On the other hand, Spark does not
>>>> > limit
>>>> > its use to only Java based languages so Python can be used. In the case
>>>> > of
>>>> > Python the PySpark library is used. When Python is used, the PySpark
>>>> > library can be used to define the RDDs that will be executed under the
>>>> > JVM.
>>>> > In this scenario, only if required, the final results of the
>>>> > calculations
>>>> > will end up being passed to Python. I say only if necessary as its
>>>> > possible
>>>> > the end results may just be left in memory or to create an output such
>>>> > as an
>>>> > hdfs file in hadoop and does not need to be transferred to Python. Under
>>>> > this scenario the code is written in Python but effectively all the
>>>> > "real"
>>>> > work is performed under the JVM.
>>>> >
>>>> > Often someone writing Python is also going to want to perform some of
>>>> > the
>>>> > operations under Python. This can be done as the RDDs that are created
>>>> > can
>>>> > contain both operations that get performed under the JVM as well as
>>>> > Python
>>>> > (and of course other languages are supported). When Python is involved
>>>> > Spark will start up Python VMs on the required nodes so that the Python
>>>> > portions of the work can be performed. The Python VMs can either be
>>>> > CPython, PyPy or even a mix of both CPython and PyPy. The downside to
>>>> > using
>>>> > non Java languages is the overhead of passing data between the JVM and
>>>> > the
>>>> > Python VM as the memory is not shared between the processes but instead
>>>> > copied/serialized between them.
>>>> >
>>>> > Because this data is copied between the 2 VMs, anyone who writes Python
>>>> > code
>>>> > for this environment always has to be conscious of the data being copied
>>>> > between the processes so as to not let the amount of the extra overhead
>>>> > become a large burden. Quite often the goal will be to first perform
>>>> > the
>>>> > bulk of the operations under the JVM and then hopefully only a smaller
>>>> > subset of the data will have to be processed under Python. If this can
>>>> > be
>>>> > done then the overhead can be minimized and then there is essential no
>>>> > down
>>>> > sides to using Python in the pipeline of operations.
>>>> >
>>>> > If your unfortunate and need to perform some of the processing early in
>>>> > the
>>>> > pipline under Python and worse yet if there is a need to go back and
>>>> > forth
>>>> > many times between Python and Java the overhead of coping huge amounts
>>>> > of
>>>> > data can significantly slow things down which essentially puts Python at
>>>> > a
>>>> > disadvantage to Java.
>>>> >
>>>> > If it was possible to change the model of execution such that it was
>>>> > possible to embed the Python VM in the JVM or vice versa and that the
>>>> > memory
>>>> > could be shared between the 2 VMs the downside of using Python in this
>>>> > environment would be eliminated or at the very least minimized to the
>>>> > point
>>>> > where it is no longer an issue. Thus the need for a jffi library.
>>>> >
>>>> > There is a strong desire by many to use dynamic languages in these
>>>> > clustered
>>>> > environments and Python is likely in the best position to become the
>>>> > language of choice due to its ability to work with C based libraries and
>>>> > of
>>>> > course its syntax. The issues that hold Python back at this point is
>>>> > the
>>>> > serialization overhead, not so great state of packaging, and not having
>>>> > both
>>>> > the speed of the JIT and complete access to numpy/scipy ecosystem.
>>>> >
>>>> > Luckily for Python at this point there is no other dynamic language that
>>>> > is
>>>> > a clear winner today. But if too much time passes before these issues
>>>> > are
>>>> > solved I'm sure another language will step up to the plate. At this
>>>> > point
>>>> > my expectations is that Node could likely make a move. It already has
>>>> > the
>>>> > speed due to the Java Script JITs, it already has a great story for
>>>> > packaging and deployment, and its growth is exploding on the server side
>>>> > due
>>>> > to all the money being poured into it. What it strongly lacks today is
>>>> > the
>>>> > connection to C/legacy code, numerical/scientific modules and of course
>>>> > it
>>>> > also does not have a solution to the data copying overhead it also has
>>>> > with
>>>> > the JVM.
>>>> >
>>>> > Any way, this is just my 2 cents on what is currently holding Python
>>>> > back
>>>> > from taking off in this space.
>>>> >
>>>> > On Thu, Mar 24, 2016 at 2:32 AM, Hakan Ardo <hakan.ardo at gmail.com>
>>>> > wrote:
>>>> >>
>>>> >>
>>>> >> On Mar 23, 2016 21:49, "Armin Rigo" <arigo at tunes.org> wrote:
>>>> >> >
>>>> >> > Hi John,
>>>> >> >
>>>> >> > On 23 March 2016 at 19:16, John Camara <john.m.camara at gmail.com>
>>>> >> > wrote:
>>>> >> > > I would like to suggest one more topic for the workshop. I see a
>>>> >> > > big
>>>> >> > > need
>>>> >> > > for a library (jffi) similar to cffi but that provides a bridge to
>>>> >> > > Java
>>>> >> > > instead of C code. The ability to seamlessly work with native Java
>>>> >> > > data/code
>>>> >> > > would offer a huge improvement (...)
>>>> >> >
>>>> >> > Isn't it what JPype does? Can you describe how it isn't suitable for
>>>> >> > your needs?
>>>> >>
>>>> >> There is also PyJNIus:
>>>> >>
>>>> >> https://pyjnius.readthedocs.org/en/latest/
>>>> >
>>>> >
>>>> >
>>>> > _______________________________________________
>>>> > pypy-dev mailing list
>>>> > pypy-dev at python.org
>>>> > https://mail.python.org/mailman/listinfo/pypy-dev
>>>> >
>>>
>>>
>> _______________________________________________
>> pypy-dev mailing list
>> pypy-dev at python.org
>> https://mail.python.org/mailman/listinfo/pypy-dev
More information about the pypy-dev
mailing list