
On Tue, Aug 25, 2015 at 3:58 PM, Charles R Harris <charlesr.harris@gmail.com
wrote:
On Tue, Aug 25, 2015 at 1:00 PM, Travis Oliphant <travis@continuum.io> wrote:
Thanks for the write-up Nathaniel. There is a lot of great detail and interesting ideas here.
<snip>
I think that summarizes my main concerns. I will write-up more forward
thinking ideas for what else is possible in the coming weeks. In the mean time, thanks for keeping the discussion going. It is extremely exciting to see the help people have continued to provide to maintain and improve NumPy. It will be exciting to see what the next few years bring as well.
I think the only thing that looks even a little bit like a numpy 2.0 at this time is dynd. Rewriting numpy, let alone producing numpy 2.0 is a major project. Dynd is 2.5+ years old, 3500+ commits in, and still in progress. If there is a decision to pursue Dynd I could support that, but I think we would want to think deeply about how to make the transition as painless as possible. It would be good at this point to get some feedback from people currently using dynd. IIRC, part of the reason for starting dynd was the perception that is was not possible to evolve numpy without running into compatibility road blocks. Travis, could you perhaps summarize the thinking that went into the decision to make dynd a separate project?
I think it would be best if Mark Wiebe speaks up here. I can explain why Continuum supported DyND with some fraction of Mark's time for a few years and give my perspective, but ultimately DyND is Mark's story to tell (and a few talented people have now joined him in the effort). Mark Wiebe was a productive NumPy developer. He was one of a few people that jumped in on the code-base and made substantial and significant changes and came to understand just how hard it can be to develop in the NumPy code-base. He also is a C++ developer who really likes the beauty and power of that language (which definitely biases his NumPy work, but he did put a lot of effort into making NumPy better). Before Peter and I started Continuum, Mark had begun the DyND project as an example of a general-purpose dynamic array library that could be used by any dynamic language to make arrays. In the early days of Continuum, we spent time from at least Mark W, Bryan Van de Ven, Jay Borque, and Francesc Alted looking at how to extend NumPy to add 1) categorical data-types, 2) variable-length strings, and 3) better date-time types. Bryan, a good developer, who has gone on to be a primary developer of Bokeh spent quite a bit of time and had a prototype of categoricals *nearly* working. He did not like working on the NumPy code-base "at all". He struggled with it and found it very difficult to extend. He worked closely with Mark Wiebe who helped him the best he could. What took him 4 weeks in NumPy took him 3 days in DyND to build. I think that experience, convinced him and Mark W both that working with NumPy code-base would take too long to make significant progress. Also, during 2012 I was trying to help with release-management (though I ended up just hiring Ondrej Certek to actually do the work and he did a great job of getting a release of NumPy out the door --- thanks to much help from many of you). At that point, I realized very clearly, that what I could best do at this point was to try and get more resources for open source and for the NumPy stack rather than work on the code directly. We also did work with several clients that helped me realize just how many disruptive changes had happened from 1.4 to 1.7 for extensive users of NumPy (much more than would be justified from a "we don't break the ABI" mantra that was the stated goal). We also realized that the kind of experimentation we wanted to do in the first 2 years of Continuum would just not be possible on the NumPy code-base and the need for getting community buy-in on every decision would slow us down too much --- as we had to iterate rapidly on so many things and find our center as a startup. It also would not be fair to the NumPy community. Our decision to do *all* of our exploration outside the NumPy code base was basically 1) the kinds of changes we wanted ultimately were potentially dramatic and disruptive, 2) it would be too difficult and time-consuming to decide all things in public discussions with the NumPy community --- especially when some things were experimental 3) tying ourselves to releases of NumPy would be difficult at that time, and 4) the design of the NumPy code-base makes it difficult to contribute to --- both Mark W and Bryan V felt they could make progress *much* faster in a new code-base. Continuum did not have enough start-up funding to devote significant time on DyND in the early days. So Mark rallied what resources he could and we supported him the best we could and he made progress. My only real requirement with sponsoring his work when we did was that it must have a python interface that did not use Boost. He stretched Cython and found a lot of holes in it and that took a bit of his time as well. I think he is now a "just write your own wrapper believer" but I shouldn't put words in his mouth or digress. DyND became part of the Blaze effort once we received DARPA money (though the grant was primarily for Bokeh but we also received permission to use some of the funds for Numba and Blaze development). Because of the other work around Numba and Blaze, DyND work was delayed quite often. For the Blaze project, mostly DyND became another implementation of the data-shape data description mechanism and a way to proto-type computed columns and remote arrays (now in Blaze server). The Blaze team struggled for the first 18 months with the lack of a gelled team and a concrete vision for what it should be exactly. Thanks to Andy Terrel, Phillip Cloud, Mark Wiebe, and Matt Rocklin as well as others who are currently on the project, Blaze is now much more clear in its goals as a high-level array and table logical object for scientists, data-scientists, and engineers that can be backed by larger-than-memory (i.e. Dask) and cluster-based computational systems (i.e. Spark and Impala). This clarity was not present as we looked for people to collaborate with and explored the space of code-compilation, delayed evaluation, and data-type-systems that are necessary and useful for distributed array-systems generally. If you look today at Ibis and Bolt-project you see other examples of what Blaze is. I see massive overlap between Blaze and these projects. I think the description of those projects can help you understand Blaze which is why I mention them. In that confusion, Mark continued to make progress on his C++-based container-type (at one point we even called it "Blaze-local") that had the advantage of not requiring a Python-runtime and could fully parse the data-shape data-description system that is a generalization of NumPy dtypes (some on Continuum time, some on his own time). Last year, he attracted the attention of Irwin Zaid who added GPU-computation capability. Last fall, Pandas was able to make DyND an optional dependency because DyND has better support for some of the key things Pandas needs and does not require the full NumPy API. In January, Mark W left Continuum to go back to work in the digital effects industry on his old code-base though he continues to take interest in DyND. A month ago, Continuum began to again sponsor Irwin to work on DyND in order to continue its development at least sufficient to support 1) Pandas and 2) processing of semi-structured data (like a collection of JSON objects). DyND is a bigger system than NumPy (as it doesn't rely on Python at all for its core functionality). The Python-interface has not always been as up to date as it could be and Irwin is currently working on that as well as making it easier to install. I'm sure he would love the help if anyone wants to join him. At the same time in 2012, I became very enamored with Numba and the potential for how Numba could make it possible to not even *have* to depend on a single container library like NumPy. I often say that If Numba and Conda had existed 15 years ago, there would not even *be* a SciPy library. Instead there would be a collection of numba-modules that do all the same things. We might not even have Julia, as well --- but that is a longer and more controversial conversation. With Numba you can write your own array-code as needed. We moved the basic array-type into an llvm specification (llvm_array.py) in old llvm.py: https://github.com/llvmpy/llvmpy/blob/master/llvm_array/array.py. (Note that llvm.py is no longer maintained, though). At this point quite a bit of the NumPy API is implemented outside of NumPy in Numba (there is still much more to do, though). As Numba has developed, I have seen how *both* DyND *and* Numba could independently be an architecture to underly a new array abstraction that could effectively replace NumPy for people. A combination of the two would be quite powerful -- especially when combined now with Dask. Numba needs 2 things presently before I can confidently say that a numpy module could be built that is fully backwards API compatible with current NumPy in about 6 months (though not necessarily semantically in all corner cases). These 2 things are currently on the near-term Numba road-map: 1) the ability to ship a Python extension module that does not require numba to be installed, and 2) jit-classes (so that you can build native-classes and have that be part of the type-specification. So, basically you have 2 additional options for NumPy future besides what Nathaniel laid out: 1) DyND-based or 2) Numba-based. A combination of the two (DyND for a pre-compiled run-time library) and Numba for JIT extensions is also a corollary. A third approach has even more potential to change super-charge Python 3.X for array-oriented programming. This approach could also be combined with DyND and/or Numba as desired. This approach is to use the fact that the buffer protocol in Python exists and therefore we *can* have more than one array-type. In fact, the basic array-structure exists as the memory-view object in Python (rescued from its unfinished form by Antoine and now supported in Cython). The main problem with it as an underlying array-type for computation 1) it's type-system is low-level struct-string syntax that is hard to build-on and 2) there are no basic computations on memory-views. These are both easily remedied. So, the approach would be to: 1) build a Python-type-to-struct-string syntax translator that would allow you to create memory-views from a Python-based type-system that replaces dtype 2) make a new gufunc sub-system that works with memory-views as containers. I think this would be an interesting project in it's own right and could borrow from current NumPy a great deal --- I think it would be simpler than the re-factor of gufuncs that Nathaniel proposes to enable dtype-information to be available to the low-level multi-methods. You can basically eliminate NumPy with something that provides those 2 things --- and that is potentially something you could rally PyPy and Jython and any other Python implementation behind (rather than numpypy and/or numpy4j). If anyone is interested in pursuing this last idea, please let me know. It hit me like a brick at PyCon this year after talking with Nathaniel about what he wanted to do with dtypes and watching Guido's talk on type-hinting now in Python 3. Finally, as I've been thinking more and more about *big* data and the needs of scaling, I've toned-down my infatuation with "typed pointers" (which NumPy basically is). The real value of "typed pointers" is that there is so much low-level code out there that does interesting things that use "typed pointers" for their basic shared abstraction. However, what we really need shared abstractions around are "typed iterators" and a whole lot of code that uses these "typed iterators" for all kinds of calculations. The problem is that there is no C-ABI equivalent for typed iterators. Where is the BLAS or LAPACK for typed-iterators that doesn't rely on a particular C++ compiler to get the memory-layout?. Every language stack implements iterators in their own way --- so you have silos and not shared abstractions across run-times. The NumPy stack on typed-iterators is now a *whole lot* harder to build. This is part of why I want to see jit-classes on Numba -- I want to end up with a defined ABI for abstractions. Abstractions are great. Shared abstractions can be *viral* and are exponentially better. We need more of those! My plea to anyone reading this is: Please make more shared abstractions ;-) Of course no one person can make a shared abstraction --- they have to emerge! One person can make abstractions though --- and that is the pre-requisite to getting them adopted by others and therefore shared. I know this is a dump of a lot of information. Some of it might even make sense and perhaps a little bit might be useful to some of you. Now for a blatant plea -- if you are interested in working on NumPy (with ideas from whatever source --- not just mine), please talk to me --- we are hiring and I can arrange for some of your time to be spent contributing to any of these ideas (including what Nathaniel wrote about --- as long as we plan for ABI breakage). Guido offered this for Python, and I will offer it for NumPy --- if you are a woman with the right back-ground I will personally commit to training you to be able to work more on NumPy. But, be warned, working on NumPy is not the path to riches and fame is fleeting ;-) Best, -Travis
<snip>
Chuck
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
-- *Travis Oliphant* *Co-founder and CEO* @teoliphant 512-222-5440 http://www.continuum.io