[IPython-dev] Fwd: [julia-dev] Re: Machine learning

Wed Jun 24 19:10:56 EDT 2015

Hi folks

I know we have a number of devs on this list interested in improving the
integration between IPython/Jupyter and PySpark, and today I found on a
thread on the Julia-dev mailing list this particularly clear summary of how
PySpark executes code and interacts with the JVM at a low level. I'd heard
the story a couple of times, but had never seen it so concisely summarized
with pointers to the specific locations in the code where the process
boundaries are crossed. Very useful.

Cheers

f

---------- Forwarded message ----------
From: Andrei Zhabinski <faithlessfriend at gmail.com>
Date: Sat, Jun 20, 2015 at 9:22 AM
Subject: Re: [julia-dev] Re: Machine learning
To: julia-dev at googlegroups.com

I feel like it's worth to describe how PySpark is implemented and what is
needed to connect Julia to Spark in the same manner.

In Spark, central concept is RDD - distributed collection of data
partitions (splits). There are many different types of RDDs, such as
MapPartitionsRDD, ShuffledRDD, CheckpointRDD, etc. Each type of RDD
introduces conceptually new feature, e.g. MapPartitionsRDD is used to
implement `map()`, `flatMap()`, `mapPartitions()` and similar methods,
ShuffleRDD is responsible for shuffling data between machines, etc.

To create new feature, every type of RDD should implement at least one
method - `compute(split: Partition, context: TaskContext): Iterator[T]`.
Essentially, `compute()` takes input partition data iterator and returns
output data iterator. This is very similar to how `mapPartitions()` works -
it simply applies some arbitrary transformation to every partition (and
this is essentially what MapPartitionsRDD's `compute()` method does[1]).
Some RDDs also involve data shuffling and working with external resources,
but they aren't that important for our talk.

PythonRDD comes in here. In its `compute()` method PythonRDD:

1) creates or reuses Python process [2]
2) writes serialized command and input data to the Python process (in a
separate thread) [3]
3) reads results from the Python process [4]

So Scala talks to Python process via socket interface using simple custom
protocol.

But essentially we want Python to talk to JVM and not vice versa. This is
where Py4J is useful. PySpark driver creates JVM and uses it to maintain
all needed objects (mainly, SparkContext and RDD). Python's RDD (i.e.
"class RDD" in "rdd.py") keeps reference to the corresponding JVM object
("PythonRDD" in "PythonRDD.scala") and calls its methods. When we write
something like this in PySpark:

    rdd = sc.textFile(...)
    rdd.map(lambda x: x**2)
       .collect()

what happens is actually this:

1) Python's RDD is created pointing to PythonRDD object in JVM
2) subclass of Python's RDD - PipelinedRDD - is created; PipelinedRDD keeps
reference to previous RDD and function `f = lambda x: x**2` to be applied
to each record in original RDD
3) `PipelinedRDD.count()` leads to calling corresponding method in
PythonRDD, then to passing all data through sockets to Python processes on
workers, collecting results and passing it back to Python process on the
driver machine.

PipelinedRDD is called so because it can pipeline Julia functions in
`map()` and `reduce()` operations without the need to go back to JVM, but
this is mostly optimization.

Some points are simplified or may contain errors, but essentially this is
more or less how it works.

So what do we need to implement Julia-Spark connector? Essentially, there
are only 2 things - Julia-aware workers (JuliaRDD) and driver
types/functions (RDD, PipelinedRDD) to call it directly from Julia. Not
that much!

[1]:
https://github.com/apache/spark/blob/branch-1.4/core/src/main/scala/org/apache/spark/rdd/MapPartitionsRDD.scala#L34
[2]:
https://github.com/apache/spark/blob/branch-1.4/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L73
[3]:
https://github.com/apache/spark/blob/branch-1.4/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L208
[4]:
https://github.com/apache/spark/blob/branch-1.4/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L106

On Friday, June 19, 2015 at 11:16:20 PM UTC+3, Jeff Waller wrote:
>
>
>
> On Friday, June 19, 2015 at 1:26:55 PM UTC-4, wil... at gmail.com wrote:
>>
>> For a Spark integration is required a Java-Julia (de)serializer like it's
>> done for Python and R. Python frontend based on Py4j. Similar RPC is done
>> for Spark-R. Because Spark is bunch of transformation (read functions), it
>> is needed to map these transformations into frontend language function
>> calls. IMHO, Spark-Julia binding deserve a try, but I'm more inclined to
>> pure Julia implementation of Spark transformations. Native Julia Spark
>> package on top Elly.jl would be beat any JVM based implementation by
>> performance and resource usage.
>>
>
> I believe that getting something that actually runs will inspire people to
> try it and make it better, and it (part 1) can be completed
> before the end of the summer.
>
> Let me see if I can understand this problem.  Are you saying it's this for
> example
>
> PySpark
>
> someRDD.reduceByKey(lambda x,y: x+y)    ---> mapped to Java via Py4j
>
> JuliaSpark
>
> reduceByKey(someRDD,(x,y)->x+y) ---->  mapped to Java via X   <--- what
> does this need to be
>
> is the tricky part is coming up with a reasonable X?
>
>
>

-- 
Fernando Perez (@fperez_org; http://fperez.org)
fperez.net-at-gmail: mailing lists only (I ignore this when swamped!)
fernando.perez-at-berkeley: contact me here for any direct mail
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/ipython-dev/attachments/20150624/63a04f29/attachment.html>