Re: [Numpy-discussion] Notes from the numpy dev meeting at scipy 2015

Aug. 26, 2015

      On Tue, Aug 25, 2015 at 3:58 PM, Charles R Harris <charlesr.harris@gmail.com
...
wrote:
...
On Tue, Aug 25, 2015 at 1:00 PM, Travis Oliphant <travis@continuum.io>
wrote:
...
Thanks for the write-up Nathaniel.   There is a lot of great detail and
interesting ideas here.
<snip>

...
I think that summarizes my main concerns.  I will write-up more forward
...
thinking ideas for what else is possible in the coming weeks.   In the mean
time, thanks for keeping the discussion going.  It is extremely exciting to
see the help people have continued to provide to maintain and improve
NumPy.    It will be exciting to see what the next few years bring as
well.
I think the only thing that looks even a little bit like a numpy 2.0 at
this time is dynd. Rewriting numpy, let alone producing numpy 2.0 is a
major project. Dynd is 2.5+ years old, 3500+ commits in, and still in
progress.  If there is a decision to pursue Dynd I could support that, but
I think we would want to think deeply about how to make the transition as
painless as possible. It would be good at this point to get some feedback
from people currently using dynd. IIRC, part of the reason for starting
dynd was the perception that is was not possible to evolve numpy without
running into compatibility road blocks. Travis, could you perhaps summarize
the thinking that went into the decision to make dynd a separate project?
I think it would be best if Mark Wiebe speaks up here.   I can explain why
Continuum supported DyND with some fraction of Mark's time for a few years
and give my perspective, but ultimately DyND is Mark's story to tell (and a
few talented people have now joined him in the effort).  Mark Wiebe was a
productive NumPy developer.   He was one of a few people that jumped in on
the code-base and made substantial and significant changes and came to
understand just how hard it can be to develop in the NumPy code-base.
He also is a C++ developer who really likes the beauty and power of that
language (which definitely biases his NumPy work, but he did put a lot of
effort into making NumPy better).  Before Peter and I started Continuum,
Mark had begun the DyND project as an example of a general-purpose dynamic
array library that could be used by any dynamic language to make arrays.

In the early days of Continuum, we spent time from at least Mark W, Bryan
Van de Ven, Jay Borque, and Francesc Alted looking at how to extend NumPy
to add 1) categorical data-types, 2) variable-length strings, and 3) better
date-time types.    Bryan, a good developer, who has gone on to be a
primary developer of Bokeh spent quite a bit of time and had a prototype of
categoricals *nearly* working.   He did not like working on the NumPy
code-base "at all".  He struggled with it and found it very difficult to
extend.    He worked closely with Mark Wiebe who helped him the best he
could.   What took him 4 weeks in NumPy took him 3 days in DyND to build.
I think that experience, convinced him and Mark W both that working with
NumPy code-base would take too long to make significant progress.

Also, during 2012 I was trying to help with release-management (though I
ended up just hiring Ondrej Certek to actually do the work and he did a
great job of getting a release of NumPy out the door --- thanks to much
help from many of you).    At that point, I realized very clearly, that
what I could best do at this point was to try and get more resources for
open source and for the NumPy stack rather than work on the code directly.
   We also did work with several clients that helped me realize just how
many disruptive changes had happened from 1.4 to 1.7 for extensive users of
NumPy (much more than would be justified from a "we don't break the ABI"
mantra that was the stated goal).

We also realized that the kind of experimentation we wanted to do in the
first 2 years of Continuum would just not be possible on the NumPy
code-base and the need for getting community buy-in on every decision would
slow us down too much --- as we had to iterate rapidly on so many things
and find our center as a startup.   It also would not be fair to the NumPy
community.     Our decision to do *all* of our exploration outside the
NumPy code base was basically 1) the kinds of changes we wanted ultimately
were potentially dramatic and disruptive, 2) it would be too difficult and
time-consuming to decide all things in public discussions with the NumPy
community --- especially when some things were experimental 3) tying
ourselves to releases of NumPy would be difficult at that time, and 4) the
design of the NumPy code-base makes it difficult to contribute to --- both
Mark W and Bryan V felt they could make progress *much* faster in a new
code-base.

Continuum did not have enough start-up funding to devote significant time
on DyND in the early days.    So Mark rallied what resources he could and
we supported him the best we could and he made progress.  My only real
requirement with sponsoring his work when we did was that it must have a
python interface that did not use Boost.   He stretched Cython and found a
lot of holes in it and that took a bit of his time as well.   I think he is
now a "just write your own wrapper believer" but I shouldn't put words in
his mouth or digress.   DyND became part of the Blaze effort once we
received DARPA money (though the grant was primarily for Bokeh but we also
received permission to use some of the funds for Numba and Blaze
development).   Because of the other work around Numba and Blaze, DyND work
was delayed quite often.   For the Blaze project, mostly DyND became
another implementation of the data-shape data description mechanism and a
way to proto-type computed columns and remote arrays (now in Blaze server).

The Blaze team struggled for the first 18 months with the lack of a gelled
team and a concrete vision for what it should be exactly.   Thanks to Andy
Terrel, Phillip Cloud, Mark Wiebe, and Matt Rocklin as well as others who
are currently on the project, Blaze is now much more clear in its goals as
a high-level array and table logical object for scientists,
data-scientists, and engineers that can be backed by larger-than-memory
(i.e. Dask) and cluster-based computational systems (i.e. Spark and
Impala).  This clarity was not present as we looked for people to
collaborate with and explored the space of code-compilation, delayed
evaluation, and data-type-systems that are necessary and useful for
distributed array-systems generally.    If you look today at Ibis and
Bolt-project you see other examples of what Blaze is.   I see massive
overlap between Blaze and these projects.    I think the description of
those projects can help you understand Blaze which is why I mention them.

In that confusion, Mark continued to make progress on his C++-based
container-type (at one point we even called it "Blaze-local") that had the
advantage of not requiring a Python-runtime and could fully parse the
data-shape data-description system that is a generalization of NumPy dtypes
(some on Continuum time, some on his own time).    Last year, he attracted
the attention of Irwin Zaid who added GPU-computation capability.   Last
fall, Pandas was able to make DyND an optional dependency because DyND has
better support for some of the key things Pandas needs and does not require
the full NumPy API.    In January, Mark W left Continuum to go back to work
in the digital effects industry on his old code-base though he continues to
take interest in DyND.  A month ago, Continuum began to again sponsor Irwin
to work on DyND in order to continue its development at least sufficient to
support 1) Pandas and 2) processing of semi-structured data (like a
collection of JSON objects).

DyND is a bigger system than NumPy (as it doesn't rely on Python at all for
its core functionality).   The Python-interface has not always been as up
to date as it could be and Irwin is currently working on that as well as
making it easier to install.    I'm sure he would love the help if anyone
wants to join him.

At the same time in 2012, I became very enamored with Numba and the
potential for how Numba could make it possible to not even *have* to depend
on a single container library like NumPy.   I often say that If Numba and
Conda had existed 15 years ago, there would not even *be* a SciPy library.
  Instead there would be a collection of numba-modules that do all the same
things.   We might not even have Julia, as well --- but that is a longer
and more controversial conversation.

With Numba you can write your own array-code as needed.  We moved the basic
array-type into an llvm specification (llvm_array.py) in old llvm.py:
https://github.com/llvmpy/llvmpy/blob/master/llvm_array/array.py.    (Note
that llvm.py is no longer maintained, though).   At this point quite a bit
of the NumPy API is implemented outside of NumPy in Numba (there is still
much more to do, though).    As Numba has developed, I have seen how *both*
DyND *and* Numba could independently be an architecture to underly a new
array abstraction that could effectively replace NumPy for people.    A
combination of the two would be quite powerful -- especially when combined
now with Dask.

Numba needs 2 things presently before I can confidently say that a numpy
module could be built that is fully backwards API compatible with current
NumPy in about 6 months (though not necessarily semantically in all corner
cases).   These 2 things are currently on the near-term Numba road-map:  1)
the ability to ship a Python extension module that does not require numba
to be installed, and 2) jit-classes (so that you can build native-classes
and have that be part of the type-specification.

So, basically you have 2 additional options for NumPy future besides what
Nathaniel laid out:  1) DyND-based or 2) Numba-based.   A combination of
the two (DyND for a pre-compiled run-time library) and Numba for JIT
extensions is also a corollary.

A third approach has even more potential to change super-charge Python 3.X
for array-oriented programming.     This approach could also be combined
with DyND and/or Numba as desired.    This approach is to use the fact that
the buffer protocol in Python exists and therefore we *can* have more than
one array-type.  In fact, the basic array-structure exists as the
memory-view object in Python (rescued from its unfinished form by Antoine
and now supported in Cython).   The main problem with it as an underlying
array-type for computation 1) it's type-system is low-level struct-string
syntax that is hard to build-on and 2) there are no basic computations on
memory-views.   These are both easily remedied.   So, the approach would be
to:

 1) build a Python-type-to-struct-string syntax translator that would allow
you to create memory-views from a Python-based type-system that replaces
dtype
 2) make a new gufunc sub-system that works with memory-views as
containers.  I think this would be an interesting project in it's own right
and could borrow from current NumPy a great deal --- I think it would be
simpler than the re-factor of gufuncs that Nathaniel proposes to enable
dtype-information to be available to the low-level multi-methods.

You can basically eliminate NumPy with something that provides those 2
things --- and that is potentially something you could rally PyPy and
Jython and any other Python implementation behind (rather than numpypy
and/or numpy4j).    If anyone is interested in pursuing this last idea,
please let me know.   It hit me like a brick at PyCon this year after
talking with Nathaniel about what he wanted to do with dtypes and watching
Guido's talk on type-hinting now in Python 3.

Finally, as I've been thinking more and more about *big* data and the needs
of scaling, I've toned-down my infatuation with "typed pointers" (which
NumPy basically is).  The real value of "typed pointers" is that there is
so much low-level code out there that does interesting things that use
"typed pointers" for their basic shared abstraction.    However, what we
really need shared abstractions around are "typed iterators" and a whole
lot of code that uses these "typed iterators" for all kinds of
calculations.    The problem is that there is no C-ABI equivalent for typed
iterators.   Where is the BLAS or LAPACK for typed-iterators that doesn't
rely on a particular C++ compiler to get the memory-layout?.     Every
language stack implements iterators in their own way --- so you have silos
and not shared abstractions across run-times.     The NumPy stack on
typed-iterators is now a *whole lot* harder to build.    This is part of
why I want to see jit-classes on Numba -- I want to end up with a defined
ABI for abstractions.

Abstractions are great.   Shared abstractions can be *viral* and are
exponentially better.   We need more of those!   My plea to anyone reading
this is: Please make more shared abstractions ;-)   Of course no one person
can make a shared abstraction --- they have to emerge!   One person can
make abstractions though --- and that is the pre-requisite to getting them
adopted by others and therefore shared.

I know this is a dump of a lot of information.   Some of it might even make
sense and perhaps a little bit might be useful to some of you.

Now for a blatant plea -- if you are interested in working on NumPy (with
ideas from whatever source --- not just mine), please talk to me --- we are
hiring and I can arrange for some of your time to be spent contributing to
any of these ideas (including what Nathaniel wrote about --- as long as we
plan for ABI breakage).   Guido offered this for Python, and I will offer
it for NumPy --- if you are a woman with the right back-ground I will
personally commit to training you to be able to work more on NumPy.    But,
be warned, working on NumPy is not the path to riches and fame is fleeting
;-)

Best,

-Travis
...
<snip>
Chuck
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion
-- 

*Travis Oliphant*
*Co-founder and CEO*

@teoliphant
512-222-5440
http://www.continuum.io