[Numpy-discussion] Notes from the numpy dev meeting at scipy 2015

Travis Oliphant travis at continuum.io
Wed Aug 26 10:06:14 EDT 2015


On Wed, Aug 26, 2015 at 1:41 AM, Nathaniel Smith <njs at pobox.com> wrote:

> Hi Travis,
>
> Thanks for taking the time to write up your thoughts!
>
> I have many thoughts in return, but I will try to restrict myself to two
> main ones :-).
>
> 1) On the question of whether work should be directed towards improving
> NumPy-as-it-is or instead towards a compatibility-breaking replacement:
> There's plenty of room for debate about whether it's better engineering
> practice to try and evolve an existing system in place versus starting
> over, and I guess we have some fundamental disagreements there, but I
> actually think this debate is a distraction -- we can agree to disagree,
> because in fact we have to try both.
>

Yes, on this we agree.   I think NumPy can improve *and* we can have new
innovative array objects.   I don't disagree about that.


>
> At a practical level: NumPy *is* going to continue to evolve, because it
> has users and people interested in evolving it; similarly, dynd and other
> alternatives libraries will also continue to evolve, because they also have
> people interested in doing it. And at a normative level, this is a good
> thing! If NumPy and dynd both get better, than that's awesome: the worst
> case is that NumPy adds the new features that we talked about at the
> meeting, and dynd simultaneously becomes so awesome that everyone wants to
> switch to it, and the result of this would be... that those NumPy features
> are exactly the ones that will make the transition to dynd easier. Or if
> some part of that plan goes wrong, then well, NumPy will still be there as
> a fallback, and in the mean time we've actually fixed the major pain points
> our users are begging us to fix.
>
> You seem to be urging us all to make a double-or-nothing wager that your
> extremely ambitious plans will all work out, with the entire numerical
> Python ecosystem as the stakes. I think this ambition is awesome, but maybe
> it'd be wise to hedge our bets a bit?
>

You are mis-characterizing my view.  I think NumPy can evolve (though I
would personally rather see a bigger change to the underlying system like I
outlined before).    But, I don't believe it can even evolve easily in the
direction needed without breaking ABI and that insisting on not breaking it
or even putting too much effort into not breaking it will continue to
create less-optimal solutions that are harder to maintain and do not take
advantage of knowledge this community now has.

I'm also very concerned that 'evolving' NumPy will create a situation where
there are regular semantic and subtle API changes that will cause NumPy to
be less stable for it's user-base.    I've watched this happen.   This at a
time that people are already looking around for new and different
approaches anyway.


>
> 2) You really emphasize this idea of an ABI-breaking (but not
> API-breaking) release, and I think this must indicate some basic gap in how
> we're looking at things. Where I'm getting stuck here is that... I actually
> can't think of anything important that we can't do now, but could if we
> were allowed to break ABI compatibility. The kinds of things that break ABI
> but keep API are like... rearranging what order the fields in a struct fall
> in, or changing the numeric value of opaque constants like
> NPY_ARRAY_WRITEABLE. The biggest win I can think of is that we could save a
> few bytes per array by arranging the fields inside the ndarray struct more
> optimally, but that's hardly a feature to hang a 2.0 on. You seem to have a
> vision of this ABI-breaking release as being something very different from
> that, and I'm not clear on what this vision is.
>
>
We already broke the ABI with date-time changes --- it's still broken for a
certain percentage of users last I checked.    So, part of my disagreement
is that we've tried this and it didn't work --- even though smart people
thought it would.    I've had to deal with this personally and I'm not
enthusiastic about having to deal with this for the next 5 years because of
even more attempts to make changes while not breaking the ABI.    I think
the group is more careful now --- but I still think the API is broad enough
and uses of NumPy deep enough that the effort involved in trying not to
break the ABI is just not worth the effort (because it's a non-feature
today).    Adding new dtypes without breaking the ABI is tricky (and to do
it without breaking the ABI is ugly).       I also continue to believe that
putting out a new ABI-breaking NumPy will allow re-compiling *once* (with
some porting changes needed) and not subtle breakages requiring
code-changes every time a release is made.    If subtle changes aren't
made, then the new features won't come.   Right now, I'd rather have
stability from NumPy than new features.   New features can come from other
libraries.

One specific change that could easily be made in NumPy 2.0 (the current
code but with an ABI change) is that Dtypes should become true type objects
and array-scalars (which are the current type-objects) should become
instances of those dtypes. That is the biggest clean-up needed, I think on
the array-front.   There should not be *both* array-scalars and dtype
objects.    They are the same thing fundamentally.    It was a mistake to
have both of them.  I don't see how to make that change without breaking
the ABI.     Perhaps it could be done in a creative way --- but why put the
effort into that and end up with an even more hacky code-base.

NumPy's ABI was influenced by and evolved from Numeric and Numarray.  It
was not "designed" to last 30 years.

I think the dtype "types" should potentially have different
member-structures.       The ufunc sub-system needs an overhaul --- it's
member structures need upgrades.   With generalized ufuncs and the
iteration protocols of Mark Wiebe we know a whole lot more about ufuncs
now.   Ufuncs are the same 1995 structure that Jim Hugunin wrote.   I
suppose you *could* just tack new functions on the end of structure and
keep growing the list (while leaving old, unused structures as unused or
deprecated) --- or you can take the opportunity to tidy up a bit.   The
longer you leave everything the same, the harder you make the code-base and
the more costly maintenance becomes.    I just don't see the value there
--- and I see a lot of pain.

Regarding the ufunc subsystem.  We've argued before about the lack of
mulit-methods in NumPy.    Continuing to add dunder-methods to try and get
around it will continue to make the system harder to maintain and more
brittle.

You mention making NumPy an interface to multiple things along with many
other ideas.   I don't believe you can get there without real changes that
break things (at the very least semantic changes).   I'm not excited about
those changes causing instability (which they will cause ---- to me the
burden of proof that they won't is on you who wants to make the change and
not on me to say how they will).       I also think it will take much
longer to get there incrementally (if at all) than just creating something
on top of newer ideas.



> The main reason I personally am against having a big ABI-breaking release
> is not that I hate ABI breakage a priori, it's that all the big features
> that I care about and the are users are asking for seem to be ones that...
> don't actually require doing that. At most they seem to get a mild benefit
> from breaking some obscure corner cases. So the cost/benefits don't make
> any sense to me.
>
> So: can you give a concrete example of a change you have in mind where
> breaking ABI would be the key enabler?
>
> (I guess you might also be thinking of a separate issue that you sort of
> allude to: Perhaps we will try to make changes which we think don't involve
> breaking the ABI, but discover too late that we have failed to fully
> understand the implications and have broken it by mistake. IIUC this is
> what happened in the 1.4 timeframe when datetime64 was merged and
> accidentally renumbered some of the NPY_* constants.
>

Yes, this is what I'm mainly worried about.    But, more than that, I'm
concerned about general *semantic* and API changes at a rapid pace for a
community that is just looking for stability and bug-fixes from NumPy
itself --- with innovation happening elsewhere.


> Partially I am less worried about this because I have a fair amount of
> confidence that our review and QA process has improved these days to the
> point that we would not let a change like that slip through by accident --
> we have a lot more active reviewers, people are sensitized to the issues,
> we've successfully landed intrusive changes like Sebastian's indexing
> rewrite, ... though this is very much second-hand impressions on my part,
> and I'd welcome input from folks like Chuck who have a clearer view on how
> things have changed from then to now.
>
> But more importantly, even if this is true, then I can't see how your
> proposal helps. If we aren't good enough at our jobs to predict when we'll
> break ABI, then by assumption it makes no sense to pick one release and
> decide that this is the one time that we'll break ABI.)
>

I don't understand your point.   Picking a release to break the ABI allows
you to actually do things like change macros to functions and move
structures around to be more consistent with a new design that is easier to
maintain and allows more growth.   It has nothing to do with "whether you
are good at your job".   Everyone has strengths and weaknesses.

This kind of clean-up may be needed regularly --- every 3 years would not
be a crazy pattern, but it could also be every 5 years if you wanted more
discipline.    I already knew we needed to break the ABI "soonish" when I
released NumPy 1.0.    The fact that we haven't officially done it yet (but
have done it unofficially) is a great injustice to "what could be" and has
slowed development of NumPy tremendously.

We've gone back and forth on this.   I'm fine if we disagree, but I just
hope the disagreement doesn't lead to lack of cooperation as we both have
the same ultimate interests in seeing array-computing in Python improve.
I just don't support *major* changes without breaking the ABI without a
whole lot of proof that it is possible (without hackiness).      You have
mentioned on your roadmap a lot of what I would consider *major* changes.
  Some of it you describe how to get there.   The most important change
(improving the dtype system) you don't.

Part of my point is that we now *know* how to improve the dtype system.
Let's do it.   Let's not try "yet again" to do it differently inside an old
system designed by a scientist who didn't understand type-theory or type
systems (that was me by the way).    Look at data-shape in the blaze
project.    Take that and build a Python type-system that also outputs
struct-string syntax for memory-views.  That's the data-description system
that NumPy should be using --- not trying to hack on a mixed array-scalar,
dtype-object system that may never support everything we now know is
needed.

Trying to incrementing from where we are now will only lead to a
sub-optimal outcome and unfortunate instability when we already know what
to do differently.    I doubt I will convince you --- certainly not via
email.   I apologize in advance that I likely won't be able to respond in
depth to any more questions that are really just "prove to me that I can't"
kind of questions.  Of course I can't prove that.   All I'm saying is that
to me the evidence and my experience leads me to not be able to support
major changes like you have proposed without also intentionally breaking
the ABI (and thus calling it NumPy 2.0).

If I find time to write, I will try to use it to outline more specifically
what I think is a better approach to array- and table-computing in Python
that keeps the stability of NumPy and adds new features using different
approaches.

-Travis





>
> On Tue, Aug 25, 2015 at 12:00 PM, Travis Oliphant <travis at continuum.io>
> wrote:
>
>> Thanks for the write-up Nathaniel.   There is a lot of great detail and
>> interesting ideas here.
>>
>> I've am very eager to understand how to help NumPy and the wider
>> community move forward however I can (my passions on this have not changed
>> since 1999, though what I myself spend time on has changed).
>>
>> There are a lot of ways to think about approaching this, though.   It's
>> hard to get all the ideas on the table, and it was unfortunate we couldn't
>> get everybody wyho are core NumPy devs together in person to have this
>> discussion as there are still a lot of questions unanswered and a lot of
>> thought that has gone into other approaches that was not brought up or
>> represented in the meeting (how does Numba fit into this, what about
>> data-shape, dynd, memory-views and Python type system, etc.).   If NumPy
>> becomes just an interface-specification, then why don't we just do that
>> *outside* NumPy itself in a way that doesn't jeopardize the stability of
>> NumPy today.    These are some of the real questions I have.   I will try
>> to write up my thoughts in more depth soon, but  I won't be able to respond
>> in-depth right now.   I just wanted to comment because Nathaniel said I
>> disagree which is only partly true.
>>
>> The three most important things for me are 1) let's make sure we have
>> representation from as wide of the community as possible (this is really
>> hard), 2) let's look around at the broader community and the prior art that
>> is happening in this space right now and 3) let's not pretend we are going
>> to be able to make all this happen without breaking ABI compatibility.
>> Let's just break ABI compatibility with NumPy 2.0 *and* have as much
>> fidelity with the API and semantics of current NumPy as possible (though
>> there will be some changes necessary long-term).
>>
>> I don't think we should intentionally break ABI if we can avoid it, but I
>> also don't think we should spend in-ordinate amounts of time trying to
>> pretend that we won't break ABI (for at least some people), and most
>> importantly we should not pretend *not* to break the ABI when we actually
>> do.    We did this once before with the roll-out of date-time, and it was
>> really un-necessary.     When I released NumPy 1.0, there were several
>> things that I knew should be fixed very soon (NumPy was never designed to
>> not break ABI).    Those problems are still there.    Now, that we have
>> quite a bit better understanding of what NumPy *should* be (there have been
>> tremendous strides in understanding and community size over the past 10
>> years), let's actually make the infrastructure we think will last for the
>> next 20 years (instead of trying to shoe-horn new ideas into a 20-year old
>> code-base that wasn't designed for it).
>>
>> NumPy is a hard code-base.  It has been since Numeric days in 1995.     I
>> could be wrong, but my guess is that we will be passed by as a community if
>> we don't seize the opportunity to build something better than we can build
>> if we are forced to use a 20 year old code-base.
>>
>> It is more important to not break people's code and to be clear when a
>> re-compile is necessary for dependencies.   Those to me are the most
>> important constraints. There are a lot of great ideas that we all have
>> about what we want NumPy to be able to do.     Some of this are pretty
>> transformational (and the more exciting they are, the harder I think they
>> are going to be to implement without breaking at least the ABI).     There
>> is probably some CAP-like theorem around
>> Stability-Features-Speed-of-Development (pick 2) when it comes to Open
>> Source Software development and making feature-progress with NumPy *is
>> going* to create in-stability which concerns me.
>>
>> I would like to see a little-bit-of-pain one time with a NumPy 2.0,
>> rather than a constant pain because of constant churn over many years
>> approach that Nathaniel seems to advocate.   To me NumPy 2.0 is an
>> ABI-breaking release that is as API-compatible as possible and whose
>> semantics are not dramatically different.
>>
>> There are at least 3 areas of compatibility (ABI, API, and semantic).
>>  ABI-compatibility is a non-feature in today's world.   There are so many
>> distributions of the NumPy stack (and conda makes it trivial for anyone to
>> build their own or for you to build one yourself).   Making less-optimal
>> software-engineering choices because of fear of breaking the ABI is not
>> something I'm supportive of at all.   We should not break ABI every
>> release, but a release every 3 years that breaks ABI is not a problem.
>>
>> API compatibility should be much more sacrosanct, but it is also
>> something that can also be managed.   Any NumPy 2.0 should definitely
>> support the full NumPy API (though there could be deprecated swaths).    I
>> think the community has done well in using deprecation and limiting the
>> public API to make this more manageable and I would love to see a NumPy 2.0
>> that solidifies a future-oriented API along with a back-ward compatible API
>> that is also available.
>>
>> Semantic compatibility is the hardest.   We have already broken this on
>> multiple occasions throughout the 1.x NumPy releases.  Every time you
>> change the code, this can change.    This is what I fear causing deep
>> instability over the course of many years.     These are things like the
>> casting rule details,  the effect of indexing changes, any change to the
>> calculations approaches.     It is and has been the most at risk during any
>> code-changes.    My view is that a NumPy 2.0 (with a new low-level
>> architecture) minimizes these changes to a single release rather than
>> unavoidably spreading them out over many, many releases.
>>
>> I think that summarizes my main concerns.  I will write-up more forward
>> thinking ideas for what else is possible in the coming weeks.   In the mean
>> time, thanks for keeping the discussion going.  It is extremely exciting to
>> see the help people have continued to provide to maintain and improve
>> NumPy.    It will be exciting to see what the next few years bring as well.
>>
>>
>> Best,
>>
>> -Travis
>>
>>
>>
>>
>>
>>
>> On Tue, Aug 25, 2015 at 5:03 AM, Nathaniel Smith <njs at pobox.com> wrote:
>>
>>> Hi all,
>>>
>>> These are the notes from the NumPy dev meeting held July 7, 2015, at
>>> the SciPy conference in Austin, presented here so the list can keep up
>>> with what happens, and so you can give feedback. Please do give
>>> feedback, none of this is final!
>>>
>>> (Also, if anyone who was there notices anything I left out or
>>> mischaracterized, please speak up -- these are a lot of notes I'm
>>> trying to gather together, so I could easily have missed something!)
>>>
>>> Thanks to Jill Cowan and the rest of the SciPy organizers for donating
>>> space and organizing logistics for us, and to the Berkeley Institute
>>> for Data Science for funding travel for Jaime, Nathaniel, and
>>> Sebastian.
>>>
>>>
>>> Attendees
>>> =========
>>>
>>>   Present in the room for all or part: Daniel Allan, Chris Barker,
>>>   Sebastian Berg, Thomas Caswell, Jeff Reback, Jaime Fernández del
>>>   Río, Chuck Harris, Nathaniel Smith, Stéfan van der Walt. (Note: I'm
>>>   pretty sure this list is incomplete)
>>>
>>>   Joining remotely for all or part: Stephan Hoyer, Julian Taylor.
>>>
>>>
>>> Formalizing our governance/decision making
>>> ==========================================
>>>
>>>   This was a major focus of discussion. At a high level, the consensus
>>>   was to steal IPython's governance document ("IPEP 29") and modify it
>>>   to remove its use of a BDFL as a "backstop" to normal community
>>>   consensus-based decision, and replace it with a new "backstop" based
>>>   on Apache-project-style consensus voting amongst the core team.
>>>
>>>   I'll send out a proper draft of this shortly for further discussion.
>>>
>>>
>>> Development roadmap
>>> ===================
>>>
>>>   General consensus:
>>>
>>>   Let's assume NumPy is going to remain important indefinitely, and
>>>   try to make it better, instead of waiting for something better to
>>>   come along. (This is unlikely to be wasted effort even if something
>>>   better does come along, and it's hardly a sure thing that that will
>>>   happen anyway.)
>>>
>>>   Let's focus on evolving numpy as far as we can without major
>>>   break-the-world changes (no "numpy 2.0", at least in the foreseeable
>>>   future).
>>>
>>>   And, as a target for that evolution, let's change our focus from
>>>   numpy as "NumPy is the library that gives you the np.ndarray object
>>>   (plus some attached infrastructure)", to "NumPy provides the
>>>   standard framework for working with arrays and array-like objects in
>>>   Python"
>>>
>>>   This means, creating defined interfaces between array-like objects /
>>>   ufunc objects / dtype objects, so that it becomes possible for third
>>>   parties to add their own and mix-and-match. Right now ufuncs are
>>>   pretty good at this, but if you want a new array class or dtype then
>>>   in most cases you pretty much have to modify numpy itself.
>>>
>>>   Vision: instead of everyone who wants a new container type having to
>>>   reimplement all of numpy, Alice can implement an array class using
>>>   (sparse / distributed / compressed / tiled / gpu / out-of-core /
>>>   delayed / ...) storage, pass it to code that was written using
>>>   direct calls to np.* functions, and it just works. (Instead of
>>>   np.sin being "the way you calculate the sine of an ndarray", it's
>>>   "the way you calculate the sine of any array-like container
>>>   object".)
>>>
>>>   Vision: Darryl can implement a new dtype for (categorical data /
>>>   astronomical dates / integers-with-missing-values / ...) without
>>>   having to touch the numpy core.
>>>
>>>   Vision: Chandni can then come along and combine them by doing
>>>
>>>   a = alice_array([...], dtype=darryl_dtype)
>>>
>>>   and it just works.
>>>
>>>   Vision: no-one is tempted to subclass ndarray, because anything you
>>>   can do with an ndarray subclass you can also easily do by defining
>>>   your own new class that implements the "array protocol".
>>>
>>>
>>> Supporting third-party array types
>>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>>
>>>   Sub-goals:
>>>   - Get __numpy_ufunc__ done, which will cover a good chunk of numpy's
>>>     API right there.
>>>   - Go through the rest of the stuff in numpy, and figure out some
>>>     story for how to let it handle third-party array classes:
>>>     - ufunc ALL the things: Some things can be converted directly into
>>>       (g)ufuncs and then use __numpy_ufunc__ (e.g., np.std); some
>>>       things could be converted into (g)ufuncs if we extended the
>>>       (g)ufunc interface a bit (e.g. np.sort, np.matmul).
>>>     - Some things probably need their own __numpy_ufunc__-like
>>>       extensions (__numpy_concatenate__?)
>>>   - Provide tools to make it easier to implement the more complicated
>>>     parts of an array object (e.g. the bazillion different methods,
>>>     many of which are ufuncs in disguise, or indexing)
>>>   - Longer-run interesting research project: __numpy_ufunc__ requires
>>>     that one or the other object have explicit knowledge of how to
>>>     handle the other, so to handle binary ufuncs with N array types
>>>     you need something like N**2 __numpy_ufunc__ code paths. As an
>>>     alternative, if there were some interface that an object could
>>>     export that provided the operations nditer needs to efficiently
>>>     iterate over (chunks of) it, then you would only need N
>>>     implementations of this interface to handle all N**2 operations.
>>>
>>>   This would solve a lot of problems for projects like:
>>>   - blosc
>>>   - dask
>>>   - distarray
>>>   - numpy.ma
>>>   - pandas
>>>   - scipy.sparse
>>>   - xray
>>>
>>>
>>> Supporting third-party dtypes
>>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>>
>>>   We already have something like a C level "dtype
>>>   protocol". Conceptually, the way you define a new dtype is by
>>>   defining a new class whose instances have data attributes defining
>>>   the parameters of the dtype (what fields are in *this* record dtype,
>>>   how many characters are in *this* string dtype, what units are used
>>>   for *this* datetime64, etc.), and you define a bunch of methods to
>>>   do things like convert an object from a Python object to your dtype
>>>   or vice-versa, to copy an array of your dtype from one place to
>>>   another, to cast to and from your new dtype, etc. This part is
>>>   great.
>>>
>>>   The problem is, in the current implementation, we don't actually use
>>>   the Python object system to define these classes / attributes /
>>>   methods. Instead, all possible dtypes are jammed into a single
>>>   Python-level class, whose struct has fields for the union of all
>>>   possible dtype's attributes, and instead of Python-style method
>>>   slots there's just a big table of function pointers attached to each
>>>   object.
>>>
>>>   So the main proposal is that we keep the basic design, but switch it
>>>   so that the float64 dtype, the int64 dtype, etc. actually literally
>>>   are subclasses of np.dtype, each implementing their own fields and
>>>   Python-style methods.
>>>
>>>   Some of the pieces involved in doing this:
>>>
>>>   - The current dtype methods should be cleaned up -- e.g. 'dot' and
>>>     'less_than' are both dtype methods, when conceptually they're much
>>>     more like ufuncs.
>>>
>>>   - The ufunc inner-loop interface currently does not get a reference
>>>     to the dtype object, so they can't see its attributes and this is
>>>     a big obstacle to many interesting dtypes (e.g., it's hard to
>>>     implement np.equal for categoricals if you don't know what
>>>     categories each has). So we need to add new arguments to the core
>>>     ufunc loop signature. (Fortunately this can be done in a
>>>     backwards-compatible way.)
>>>
>>>   - We need to figure out what exactly the dtype methods should be,
>>>     and add them to the dtype class (possibly with backwards
>>>     compatibility shims for anyone who is accessing PyArray_ArrFuncs
>>>     directly).
>>>
>>>   - Casting will be possibly the trickiest thing to work out, though
>>>     the basic idea of using dunder-dispatch-like __cast__ and
>>>     __rcast__ methods seems workable. (Encouragingly, this is also
>>>     exactly what dynd also does, though unfortunately dynd does not
>>>     yet support user-defined dtypes even to the extent that numpy
>>>     does, so there isn't much else we can steal from them.)
>>>     - We may also want to rethink the casting rules while we're at it,
>>>       since they have some very weird corners right now (e.g. see
>>>       [https://github.com/numpy/numpy/issues/6240])
>>>
>>>   - We need to migrate the current dtypes over to the new system,
>>>     which can be done in stages:
>>>
>>>     - First stick them all in a single "legacy dtype" class whose
>>>       methods just dispatch to the PyArray_ArrFuncs per-object "method
>>>       table"
>>>
>>>     - Then move each of them into their own classes
>>>
>>>   - We should provide a Python-level wrapper for the protocol, so that
>>>     you can call dtype methods from Python
>>>
>>>   - And vice-versa, it should be possible to subclass dtype at the
>>>     Python level
>>>
>>>   - etc.
>>>
>>>   Fortunately, AFAICT pretty much all of this can be done while
>>>   maintaining backwards compatibility (though we may want to break
>>>   some obscure cases to avoid expending *too* much effort with weird
>>>   backcompat contortions that will only help a vanishingly small
>>>   proportion of the userbase), and a lot of the above changes can be
>>>   done as semi-independent mini-projects, so there's no need for some
>>>   branch to go off and spend a year rewriting the world.
>>>
>>>   Obviously there are still a lot of details to work out, though. But
>>>   overall, there was widespread agreement that this is one of the #1
>>>   pain points for our users (e.g. it's the single main request from
>>>   pandas), and fixing it is very high priority.
>>>
>>>   Some features that would become straightforward to implement
>>>   (e.g. even in third-party libraries) if this were fixed:
>>>   - missing value support
>>>   - physical unit tracking (meters / seconds -> array of velocity;
>>>     meters + seconds -> error)
>>>   - better and more diverse datetime representations (e.g. datetimes
>>>     with attached timezones, or using funky geophysical or
>>>     astronomical calendars)
>>>   - categorical data
>>>   - variable length strings
>>>   - strings-with-encodings (e.g. latin1)
>>>   - forward mode automatic differentiation (write a function that
>>>     computes f(x) where x is an array of float64; pass that function
>>>     an array with a special dtype and get out both f(x) and f'(x))
>>>   - probably others I'm forgetting right now
>>>
>>>   I should also note that there was one substantial objection to this
>>>   plan, from Travis Oliphant (in discussions later in the
>>>   conference). I'm not confident I understand his objections well
>>>   enough to reproduce them here, though -- perhaps he'll elaborate.
>>>
>>>
>>> Money
>>> =====
>>>
>>>   There was an extensive discussion on the topic of: "if we had money,
>>>   what would we do with it?"
>>>
>>>   This is partially motivated by the realization that there are a
>>>   number of sources that we could probably get money from, if we had a
>>>   good story for what we wanted to do, so it's not just an idle
>>>   question.
>>>
>>>   Points of general agreement:
>>>
>>>   - Doing the in-person meeting was a good thing. We should plan do
>>>     that again, at least once a year. So one thing to spend money on
>>>     is travel subsidies to make sure that happens and is productive.
>>>
>>>   - While it's tempting to imagine hiring junior people for the more
>>>     frustrating/boring work like maintaining buildbots, release
>>>     infrastructure, updating docs, etc., this seems difficult to do
>>>     realistically with our current resources -- how do we hire for
>>>     this, who would manage them, etc.?
>>>
>>>   - On the other hand, the general feeling was that if we found the
>>>     money to hire a few more senior people who could take care of
>>>     themselves more, then that would be good and we could
>>>     realistically absorb that extra work without totally unbalancing
>>>     the project.
>>>
>>>     - A major open question is how we would recruit someone for a
>>>       position like this, since apparently all the obvious candidates
>>>       who are already active on the NumPy team already have other
>>>       things going on. [For calibration on how hard this can be: NYU
>>>       has apparently had an open position for a year with the job
>>>       description of "come work at NYU full-time with a
>>>       private-industry-competitive-salary on whatever your personal
>>>       open-source scientific project is" (!) and still is having an
>>>       extremely difficult time filling it:
>>>       [http://cds.nyu.edu/research-engineer/]]
>>>
>>>     - General consensus though was that there isn't much to be done
>>>       about this though, except try it and see.
>>>
>>>     - (By the way, if you're someone who's reading this and
>>>       potentially interested in like a postdoc or better working on
>>>       numpy, then let's talk...)
>>>
>>>
>>> More specific changes to numpy that had general consensus, but don't
>>> really fit into a high-level roadmap
>>>
>>> =========================================================================================================
>>>
>>>   - Resolved: we should merge multiarray.so and umath.so into a single
>>>     extension module, so that they can share utility code without the
>>>     current awkward contortions.
>>>
>>>   - Resolved: we should start hiding new fields in the ufunc and dtype
>>>     structs as soon as possible going forward. (I.e. they would not be
>>>     present in the version of the structs that are exposed through the
>>>     C API, but internally we would use a more detailed struct.)
>>>     - Mayyyyyybe we should even go ahead and hide the subset of the
>>>       existing fields that are really internal details that no-one
>>>       should be using. If we did this without changing anything else
>>>       then it would preserve ABI (the fields would still be where
>>>       existing compiled extensions expect them to be, if any such
>>>       extensions exist) while breaking API (trying to compile such
>>>       extensions would give a clear error), so would be a smoother
>>>       ramp if we think we need to eventually break those fields for
>>>       real. (As discussed above, there are a bunch of fields in the
>>>       dtype base class that only make sense for specific dtype
>>>       subclasses, e.g. only record dtypes need a list of field names,
>>>       but right now all dtypes have one anyway. So it would be nice to
>>>       remove these from the base class entirely, but that is
>>>       potentially ABI-breaking.)
>>>
>>>   - Resolved: np.array should never return an object array unless
>>>     explicitly requested (e.g. with dtype=object); it just causes too
>>>     many surprising problems.
>>>     - First step: add a deprecation warning
>>>     - Eventually: make it an error.
>>>
>>>   - The matrix class
>>>     - Resolved: We won't add warnings yet, but we will prominently
>>>       document that it is deprecated and should be avoided where-ever
>>>       possible.
>>>       - Stéfan van der Walt volunteers to do this.
>>>     - We'd all like to deprecate it properly, but the feeling was that
>>>       the precondition for this is for scipy.sparse to provide sparse
>>>       "arrays" that don't return np.matrix objects on ordinary
>>>       operatoins. Until that happens we can't reasonably tell people
>>>       that using np.matrix is a bug.
>>>
>>>   - Resolved: we should add a similar prominent note to the
>>>     "subclassing ndarray" documentation, warning people that this is
>>>     painful and barely works and please don't do it if you have any
>>>     alternatives.
>>>
>>>   - Resolved: we want more, smaller releases -- every 6 months at
>>>     least, aiming to go even faster (every 4 months?)
>>>
>>>   - On the question of using Cython inside numpy core:
>>>     - Everyone agrees that there are places where this would be an
>>>       improvement (e.g., Python<->C interfaces, and places "when you
>>>       want to do computer science", e.g. complicated algorithmic stuff
>>>       like graph traversals)
>>>     - Chuck wanted it to be clear though that he doesn't think it
>>>       would be a good goal to try and rewrite all of numpy in Cython
>>>       -- there also exist places where Cython ends up being "an uglier
>>>       version of C". No-one disagreed.
>>>
>>>   - Our text reader is apparently not very functional on Python 3, and
>>>     generally slow and hard to work with.
>>>     - Resolved: We should extract Pandas's awesome text reader/parser
>>>       and convert it into its own package, that could then become a
>>>       new backend for both pandas and numpy.loadtxt.
>>>     - Jeff thinks this is a great idea
>>>     - Thomas Caswell volunteers to do the extraction.
>>>
>>>   - We should work on improving our tools for evolving the ABI, so
>>>     that we will eventually be less constrained by decisions made
>>>     decades ago.
>>>     - One idea that had a lot of support was to switch from our
>>>       current append-only C-API to a "sliding window" API based on
>>>       explicit versions. So a downstream package might say
>>>
>>>       #define NUMPY_API_VERSION 4
>>>
>>>       and they'd get the functions and behaviour provided in "version
>>>       4" of the numpy C api. If they wanted to get access to new stuff
>>>       that was added in version 5, then they'd need to switch that
>>>       #define, and at the same time clean up any usage of stuff that
>>>       was removed or changed in version 5. And to provide a smooth
>>>       migration path, one version of numpy would support multiple
>>>       versions at once, gradually deprecating and dropping old
>>>       versions.
>>>
>>>     - If anyone wants to help bring pip up to scratch WRT tracking ABI
>>>       dependencies (e.g., 'pip install numpy==<version with new ABI>'
>>>       -> triggers rebuild of scipy against the new ABI), then that
>>>       would be an extremely useful thing.
>>>
>>>
>>> Policies that should be documented
>>> ==================================
>>>
>>>   ...together with some notes about what the contents of the document
>>>   should be:
>>>
>>>
>>> How we manage bugs in the bug tracker.
>>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>>
>>>   - Github "milestones" should *only* be assigned to release-blocker
>>>     bugs (which mostly means "regression from the last release").
>>>
>>>     In particular, if you're tempted to push a bug forward to the next
>>>     release... then it's clearly not a blocker, so don't set it to the
>>>     next release's milestone, just remove the milestone entirely.
>>>
>>>     (Obvious exception to this: deprecation followup bugs where we
>>>     decide that we want to keep the deprecation around a bit longer
>>>     are a case where a bug actually does switch from being a blocker
>>>     to release 1.x to being a blocker for release 1.(x+1).)
>>>
>>>   - Don't hesitate to close an issue if there's no way forward --
>>>     e.g. a PR where the author has disappeared. Just post a link to
>>>     this policy and close, with a polite note that we need to keep our
>>>     tracker useful as a todo list, but they're welcome to re-open if
>>>     things change.
>>>
>>>
>>> Deprecations and breakage policy:
>>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>>
>>>   - How long do we need to keep DeprecationWarnings around before we
>>>     break things? This is tricky because on the one hand an aggressive
>>>     (short) deprecation period lets us deliver new features and
>>>     important cleanups more quickly, but on the other hand a
>>>     too-aggressive deprecation period is difficult for our more
>>>     conservative downstream users.
>>>
>>>     - Idea that had the most support: pick a somewhat-aggressive
>>>       warning period as our default, and make a rule that if someone
>>>       asks for an extension during the beta cycle for the release that
>>>       removes it, then we put it back for another release or two worth
>>>       of grace period. (While also possibly upgrading the warning to
>>>       be more visible during the grace period.) This gives us
>>>       deprecation periods that are more adaptive on a case-by-case
>>>       basis.
>>>
>>>   - Lament: it would be really nice if we could get more people to
>>>     test our beta releases, because in practice right now 1.x.0 ends
>>>     up being where we actually the discover all the bugs, and 1.x.1 is
>>>     where it actually becomes usable. Which sucks, and makes it
>>>     difficult to have a solid policy about what counts as a
>>>     regression, etc. Is there anything we can do about this?
>>>
>>>   - ABI breakage: we distinguish between an ABI break that breaks
>>>     everything (e.g., "import scipy" segfaults), versus an ABI break
>>>     that breaks an occasional rare case (e.g., only apps that poke
>>>     around in some obscure corner of some struct are affected).
>>>
>>>     - The "break-the-world" type remains off-limit for now: the pain
>>>       is still too large (conda helps, but there are lots of people
>>>       who don't use conda!), and there aren't really any compelling
>>>       improvements that this would enable anyway.
>>>
>>>     - For the "break-0.1%-of-users" type, it is *not* ruled out by
>>>       fiat, though we remain conservative: we should treat it like
>>>       other API breaks in principle, and do a careful case-by-case
>>>       analysis of the details of the situation, taking into account
>>>       what kind of code would be broken, how common these cases are,
>>>       how important the benefits are, whether there are any specific
>>>       mitigation strategies we can use, etc. -- with this process of
>>>       course taking into account that a segfault is nastier than a
>>>       Python exception.
>>>
>>>
>>> Other points that were discussed
>>> ================================
>>>
>>>   - There was inconclusive discussion of what we should do with dot()
>>>     in the places where it disagrees with the PEP 465 matmul semantics
>>>     (specifically this is when both arguments have ndim >= 3, or one
>>>     argument has ndim == 0).
>>>     - The concern is that the current behavior is not very useful, and
>>>       as far as we can tell no-one is using it; but, as people get
>>>       used to the more-useful PEP 465 behavior, they will increasingly
>>>       try to use it on the assumption that np.dot will work the same
>>>       way, and this will create pain for lots of people. So Nathaniel
>>>       argued that we should start at least issuing a visible warning
>>>       when people invoke the corner-case behavior.
>>>     - But OTOH, np.dot is such a core piece of infrastructure, and
>>>       there's such a large landscape of code out there using numpy
>>>       that we can't see, that others were reasonably wary of making
>>>       any change.
>>>     - For now: document prominently, but no change in behavior.
>>>
>>>
>>> Links to raw notes
>>> ==================
>>>
>>>   Main page:
>>>   [https://github.com/numpy/numpy/wiki/SciPy-2015-developer-meeting]
>>>
>>>   Notes from the meeting proper:
>>>   [
>>> https://docs.google.com/document/d/1IJcYdsHtk8MVAM4AZqFDBSf_nVG-mrB4Tv2bh9u1g4Y/edit?usp=sharing
>>> ]
>>>
>>>   Slides from the followup BoF:
>>>   [
>>> https://gist.github.com/njsmith/eb42762054c88e810786/raw/b74f978ce10a972831c582485c80fb5b8e68183b/future-of-numpy-bof.odp
>>> ]
>>>
>>>   Notes from the followup BoF:
>>>   [
>>> https://docs.google.com/document/d/11AuTPms5dIPo04JaBOWEoebXfk-tUzEZ-CvFnLIt33w/edit
>>> ]
>>>
>>> -n
>>>
>>> --
>>> Nathaniel J. Smith -- http://vorpus.org
>>> _______________________________________________
>>> NumPy-Discussion mailing list
>>> NumPy-Discussion at scipy.org
>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>>
>>
>>
>>
>> --
>>
>> *Travis Oliphant*
>> *Co-founder and CEO*
>>
>>
>> @teoliphant
>> 512-222-5440
>> http://www.continuum.io
>>
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion at scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>
>>
>
>
> --
> Nathaniel J. Smith -- http://vorpus.org
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>


-- 

*Travis Oliphant*
*Co-founder and CEO*


@teoliphant
512-222-5440
http://www.continuum.io
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20150826/81a342bd/attachment.html>


More information about the NumPy-Discussion mailing list