[Numpy-discussion] big-bangs versus incremental improvements (was: Re: SciPy 2014 BoF NumPy Participation)

Wed Jun 4 20:56:51 EDT 2014

On Wed, Jun 4, 2014 at 7:18 AM, Travis Oliphant <travis at continuum.io> wrote:
> Even relatively simple changes can have significant impact at this point.
> Nathaniel has laid out a fantastic list of great features.  These are the
> kind of features I have been eager to see as well.  This is why I have been
> working to fund and help explore these ideas in the Numba array object as
> well as in Blaze.    Gnumpy, Theano, Pandas, and other projects also have
> useful tales to tell regarding a potential NumPy 2.0.

I think this is somewhat missing the main point of my message :-). I
was specifically laying out a list of features that we could start
working on *right now*, *without* waiting for the mythical "numpy
2.0".

> Ultimately, I do think it is time to talk seriously about NumPy 2.0, and
> what it might look like.   I personally think it looks a lot more like a
> re-write, than a continuation of the modifications of Numeric that became
> NumPy 1.0.     Right out of the gate,  for example, I would make sure that
> NumPy 2.0 objects somehow used PyObject_VAR_HEAD so that they were
> variable-sized objects where the strides and dimension information was
> stored directly in the object structure itself instead of allocated
> separately (thus requiring additional loads and stores from memory).   This
> would be a relatively simple change.  But, it can't be done and preserve ABI
> compatibility.  It may also, at this point, have impact on Cython code, or
> other code that is deeply-aware of the NumPy code-structure.     Some of the
> changes that should be made will ultimately require a porting exercise for
> new code --- at which point why not just use a new project.

I'm not aware of any obstacles to packing strides/dimension/data into
the ndarray object right now, tomorrow if you like -- we've even
discussed doing this recently in the tracker. PyObject_VAR_HEAD in
particular seems... irrelevant? All it is is syntactic sugar for
adding an integer field called "ob_size" to a Python object struct,
plus a few macros for working with this field. We don't need or want
such a field anyway (for shape/strides it would be redundant with
ndim), and even if we did want such a field we could add it any time
without breaking ABI. And if someday we do discover some compelling
advantage to breaking ABI by rearranging the ndarray struct, then we
can do this with a bit of planning by using #ifdef's to make the
rearrangement coincide with a new Python release. E.g., people
building against python 3.5 get the new struct layout, people building
against 3.4 get the old, and in a few years we drop support for the
old. No compatibility breaks needed, never mind rewrites.

More generally: I wouldn't rule out "numpy 2.0" entirely, but we need
to remember the immense costs that a rewrite-and-replace strategy will
incur. Writing a new library is very expensive, so that's one cost.
But that cost is nothing compared to the costs of getting that new
library to the same level of maturity that numpy has already reached.
And those costs, in turn, are absolutely dwarfed by the transition
costs of moving the whole ecosystem from one foundation to a
different, incompatible one. And probably even these costs are small
compared to the opportunity costs -- all the progress that *doesn't*
get made in the mean time because fragmented ecosystems suck and make
writing code hard, and the best hackers are busy porting code instead
of writing awesome new stuff. I'm sure dynd is great, but we have to
be realistic: the hard truth is that even if it's production-ready
today, that only brings us a fraction of a fraction of a percent
closer to making it a real replacement for numpy.

Consider the python 2 to python 3 transition: Python 3 itself was an
immense amount of work for a large number of people, with intense
community scrutiny of the design. It came out in 2008. 6 years and
many many improvements later, it's maybe sort-of starting to look like
a plurality of users might start transitioning soonish? It'll be years
yet before portable libraries can start taking advantage of python 3's
new awesomeness. And in the mean time, the progress of the whole
Python ecosystem has been seriously disrupted: think of how much
awesome stuff we'd have if all the time that's been spent porting and
testing different packages had been spent on moving them forward
instead. We also have experience closer to home -- did anyone enjoy
the numeric/numarray->numpy transition so much they want to do it
again? And numpy will be much harder to replace than numeric --
numeric wasn't the most-imported package in the pythonverse ;-). And
my biggest worry is that if anyone even tries to convince everyone to
make this kind of transition, then if they're successful at all then
they'll create a substantial period where the ecosystem is a big
incompatible mess (and they might still eventually fail, providing no
long-term benefit to make up for the immediate costs). This scenario
is a nightmare for end-users all around.

By comparison, if we improve numpy incrementally, then we can in most
cases preserve compatibility totally, and in the rare cases where it's
necessary to break something we can do it mindfully, minimally, and
with a managed transition. (Downstream packages are already used to
handling a few limited API changes at a time, it's not that hard to
support both APIs during the transition period, etc., so this way we
bring the ecosystem with us.) Every incremental improvement to numpy
immediately benefits its immense user base, and gets feedback and
testing from that immense user base. And if we incrementally improve
interoperability between numpy and other libraries like dynd, then
instead of creating fragmentation, it will let downstream packages use
both in a complementary way, switching back and forth depending on
which provides more utility on a case-by-case basis. If this means
that numpy eventually withers away because users vote with their feet,
then great, that'd be compelling evidence that whatever they were
migrating to really is better, which I trust a lot more than any
guesses we make on a mailing list. The gradual approach does require
that we be grown-ups and hold our noses while refactoring out legacy
spaghetti and writing unaesthetic compatibility hacks. But if you
compare this to the alternative... the benefits of incrementalism are,
IMO, overwhelming.

The only exception is when two specific criteria are met: (1) there
are changes that are absolutely necessary for the ecosystem's long
term health (e.g., py3's unicode-for-mere-mortals and true division),
AND (2) it's absolutely impossible to make these changes incrementally
(unicode and true division first entered Python in 2000 and 2001,
respectively, and immense effort went into finding the smoothest
transition, so it's pretty clear that as painful as py3 has been,
there isn't really anything better.).

What features could meet these two criteria in numpy's case? If I were
the numpy ecosystem and you tried to convince me to suffer through a
big-bang transition for the sake of PyObject_VAR_HEAD then I think I'd
be kinda unconvinced. And it only took me a few minutes to rattle off
a whole list of incremental changes that haven't even been tried yet.

-n

-- 
Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org