[Numpy-discussion] Notes from the numpy dev meeting at scipy 2015

Marten van Kerkwijk m.h.vankerkwijk at gmail.com
Mon Aug 31 00:12:46 EDT 2015


Hi Nathaniel, others,

I read the discussion of plans with interest. One item that struck me is
that while there are great plans to have a proper extensible and presumably
subclassable dtype, it is discouraged to subclass ndarray itself (rather,
it is encouraged to use a broader array interface). From my experience with
astropy in both Quantity (an ndarray subclass), Time (a separate class
containing high precision times using two ndarray float64), and Table
(initially holding structured arrays, but now sets of Columns, which
themselves are ndarray subclasses), I'm not convinced the broader, new
containers approach is that much preferable. Rather, it leads to a lot of
boiler-plate code to reimplement things ndarray does already (since one is
effectively just calling the methods on the underlying arrays).

I also think the idea that a dtype becomes something that also contains a
unit is a bit odd. Shouldn't dtype just be about how data is stored? Why
include meta-data such as units?

Instead, I think a quantity is most logically seen as numbers with a unit,
just like masked arrays are numbers with masks, and variables numbers with
uncertainties. Each of these cases adds extra information in different
forms, and all are quite easily thought of as subclasses of ndarray where
all operations do the normal operation, plus some extra work to keep the
extra information up to date.

Anyway, my suggestion would be to *encourage* rather than discourage
ndarray subclassing, and help this by making ndarray (even) better.

All the best,

Marten




On Thu, Aug 27, 2015 at 11:03 AM, <josef.pktd at gmail.com> wrote:

>
>
> On Wed, Aug 26, 2015 at 10:06 AM, Travis Oliphant <travis at continuum.io>
> wrote:
>
>>
>>
>> On Wed, Aug 26, 2015 at 1:41 AM, Nathaniel Smith <njs at pobox.com> wrote:
>>
>>> Hi Travis,
>>>
>>> Thanks for taking the time to write up your thoughts!
>>>
>>> I have many thoughts in return, but I will try to restrict myself to two
>>> main ones :-).
>>>
>>> 1) On the question of whether work should be directed towards improving
>>> NumPy-as-it-is or instead towards a compatibility-breaking replacement:
>>> There's plenty of room for debate about whether it's better engineering
>>> practice to try and evolve an existing system in place versus starting
>>> over, and I guess we have some fundamental disagreements there, but I
>>> actually think this debate is a distraction -- we can agree to disagree,
>>> because in fact we have to try both.
>>>
>>
>> Yes, on this we agree.   I think NumPy can improve *and* we can have new
>> innovative array objects.   I don't disagree about that.
>>
>>
>>>
>>> At a practical level: NumPy *is* going to continue to evolve, because it
>>> has users and people interested in evolving it; similarly, dynd and other
>>> alternatives libraries will also continue to evolve, because they also have
>>> people interested in doing it. And at a normative level, this is a good
>>> thing! If NumPy and dynd both get better, than that's awesome: the worst
>>> case is that NumPy adds the new features that we talked about at the
>>> meeting, and dynd simultaneously becomes so awesome that everyone wants to
>>> switch to it, and the result of this would be... that those NumPy features
>>> are exactly the ones that will make the transition to dynd easier. Or if
>>> some part of that plan goes wrong, then well, NumPy will still be there as
>>> a fallback, and in the mean time we've actually fixed the major pain points
>>> our users are begging us to fix.
>>>
>>> You seem to be urging us all to make a double-or-nothing wager that your
>>> extremely ambitious plans will all work out, with the entire numerical
>>> Python ecosystem as the stakes. I think this ambition is awesome, but maybe
>>> it'd be wise to hedge our bets a bit?
>>>
>>
>> You are mis-characterizing my view.  I think NumPy can evolve (though I
>> would personally rather see a bigger change to the underlying system like I
>> outlined before).    But, I don't believe it can even evolve easily in the
>> direction needed without breaking ABI and that insisting on not breaking it
>> or even putting too much effort into not breaking it will continue to
>> create less-optimal solutions that are harder to maintain and do not take
>> advantage of knowledge this community now has.
>>
>> I'm also very concerned that 'evolving' NumPy will create a situation
>> where there are regular semantic and subtle API changes that will cause
>> NumPy to be less stable for it's user-base.    I've watched this happen.
>> This at a time that people are already looking around for new and different
>> approaches anyway.
>>
>>
>>>
>>> 2) You really emphasize this idea of an ABI-breaking (but not
>>> API-breaking) release, and I think this must indicate some basic gap in how
>>> we're looking at things. Where I'm getting stuck here is that... I actually
>>> can't think of anything important that we can't do now, but could if we
>>> were allowed to break ABI compatibility. The kinds of things that break ABI
>>> but keep API are like... rearranging what order the fields in a struct fall
>>> in, or changing the numeric value of opaque constants like
>>> NPY_ARRAY_WRITEABLE. The biggest win I can think of is that we could save a
>>> few bytes per array by arranging the fields inside the ndarray struct more
>>> optimally, but that's hardly a feature to hang a 2.0 on. You seem to have a
>>> vision of this ABI-breaking release as being something very different from
>>> that, and I'm not clear on what this vision is.
>>>
>>>
>> We already broke the ABI with date-time changes --- it's still broken for
>> a certain percentage of users last I checked.    So, part of my
>> disagreement is that we've tried this and it didn't work --- even though
>> smart people thought it would.    I've had to deal with this personally and
>> I'm not enthusiastic about having to deal with this for the next 5 years
>> because of even more attempts to make changes while not breaking the ABI.
>>  I think the group is more careful now --- but I still think the API is
>> broad enough and uses of NumPy deep enough that the effort involved in
>> trying not to break the ABI is just not worth the effort (because it's a
>> non-feature today).    Adding new dtypes without breaking the ABI is tricky
>> (and to do it without breaking the ABI is ugly).       I also continue to
>> believe that putting out a new ABI-breaking NumPy will allow re-compiling
>> *once* (with some porting changes needed) and not subtle breakages
>> requiring code-changes every time a release is made.    If subtle changes
>> aren't made, then the new features won't come.   Right now, I'd rather have
>> stability from NumPy than new features.   New features can come from other
>> libraries.
>>
>> One specific change that could easily be made in NumPy 2.0 (the current
>> code but with an ABI change) is that Dtypes should become true type objects
>> and array-scalars (which are the current type-objects) should become
>> instances of those dtypes. That is the biggest clean-up needed, I think on
>> the array-front.   There should not be *both* array-scalars and dtype
>> objects.    They are the same thing fundamentally.    It was a mistake to
>> have both of them.  I don't see how to make that change without breaking
>> the ABI.     Perhaps it could be done in a creative way --- but why put the
>> effort into that and end up with an even more hacky code-base.
>>
>> NumPy's ABI was influenced by and evolved from Numeric and Numarray.  It
>> was not "designed" to last 30 years.
>>
>> I think the dtype "types" should potentially have different
>> member-structures.       The ufunc sub-system needs an overhaul --- it's
>> member structures need upgrades.   With generalized ufuncs and the
>> iteration protocols of Mark Wiebe we know a whole lot more about ufuncs
>> now.   Ufuncs are the same 1995 structure that Jim Hugunin wrote.   I
>> suppose you *could* just tack new functions on the end of structure and
>> keep growing the list (while leaving old, unused structures as unused or
>> deprecated) --- or you can take the opportunity to tidy up a bit.   The
>> longer you leave everything the same, the harder you make the code-base and
>> the more costly maintenance becomes.    I just don't see the value there
>> --- and I see a lot of pain.
>>
>> Regarding the ufunc subsystem.  We've argued before about the lack of
>> mulit-methods in NumPy.    Continuing to add dunder-methods to try and get
>> around it will continue to make the system harder to maintain and more
>> brittle.
>>
>> You mention making NumPy an interface to multiple things along with many
>> other ideas.   I don't believe you can get there without real changes that
>> break things (at the very least semantic changes).   I'm not excited about
>> those changes causing instability (which they will cause ---- to me the
>> burden of proof that they won't is on you who wants to make the change and
>> not on me to say how they will).       I also think it will take much
>> longer to get there incrementally (if at all) than just creating something
>> on top of newer ideas.
>>
>>
>>
>>> The main reason I personally am against having a big ABI-breaking
>>> release is not that I hate ABI breakage a priori, it's that all the big
>>> features that I care about and the are users are asking for seem to be ones
>>> that... don't actually require doing that. At most they seem to get a mild
>>> benefit from breaking some obscure corner cases. So the cost/benefits don't
>>> make any sense to me.
>>>
>>> So: can you give a concrete example of a change you have in mind where
>>> breaking ABI would be the key enabler?
>>>
>>> (I guess you might also be thinking of a separate issue that you sort of
>>> allude to: Perhaps we will try to make changes which we think don't involve
>>> breaking the ABI, but discover too late that we have failed to fully
>>> understand the implications and have broken it by mistake. IIUC this is
>>> what happened in the 1.4 timeframe when datetime64 was merged and
>>> accidentally renumbered some of the NPY_* constants.
>>>
>>
>> Yes, this is what I'm mainly worried about.    But, more than that, I'm
>> concerned about general *semantic* and API changes at a rapid pace for a
>> community that is just looking for stability and bug-fixes from NumPy
>> itself --- with innovation happening elsewhere.
>>
>>
>>> Partially I am less worried about this because I have a fair amount of
>>> confidence that our review and QA process has improved these days to the
>>> point that we would not let a change like that slip through by accident --
>>> we have a lot more active reviewers, people are sensitized to the issues,
>>> we've successfully landed intrusive changes like Sebastian's indexing
>>> rewrite, ... though this is very much second-hand impressions on my part,
>>> and I'd welcome input from folks like Chuck who have a clearer view on how
>>> things have changed from then to now.
>>>
>>> But more importantly, even if this is true, then I can't see how your
>>> proposal helps. If we aren't good enough at our jobs to predict when we'll
>>> break ABI, then by assumption it makes no sense to pick one release and
>>> decide that this is the one time that we'll break ABI.)
>>>
>>
>> I don't understand your point.   Picking a release to break the ABI
>> allows you to actually do things like change macros to functions and move
>> structures around to be more consistent with a new design that is easier to
>> maintain and allows more growth.   It has nothing to do with "whether you
>> are good at your job".   Everyone has strengths and weaknesses.
>>
>> This kind of clean-up may be needed regularly --- every 3 years would not
>> be a crazy pattern, but it could also be every 5 years if you wanted more
>> discipline.    I already knew we needed to break the ABI "soonish" when I
>> released NumPy 1.0.    The fact that we haven't officially done it yet (but
>> have done it unofficially) is a great injustice to "what could be" and has
>> slowed development of NumPy tremendously.
>>
>> We've gone back and forth on this.   I'm fine if we disagree, but I just
>> hope the disagreement doesn't lead to lack of cooperation as we both have
>> the same ultimate interests in seeing array-computing in Python improve.
>> I just don't support *major* changes without breaking the ABI without a
>> whole lot of proof that it is possible (without hackiness).      You have
>> mentioned on your roadmap a lot of what I would consider *major* changes.
>>   Some of it you describe how to get there.   The most important change
>> (improving the dtype system) you don't.
>>
>> Part of my point is that we now *know* how to improve the dtype system.
>> Let's do it.   Let's not try "yet again" to do it differently inside an old
>> system designed by a scientist who didn't understand type-theory or type
>> systems (that was me by the way).    Look at data-shape in the blaze
>> project.    Take that and build a Python type-system that also outputs
>> struct-string syntax for memory-views.  That's the data-description system
>> that NumPy should be using --- not trying to hack on a mixed array-scalar,
>> dtype-object system that may never support everything we now know is
>> needed.
>>
>> Trying to incrementing from where we are now will only lead to a
>> sub-optimal outcome and unfortunate instability when we already know what
>> to do differently.    I doubt I will convince you --- certainly not via
>> email.   I apologize in advance that I likely won't be able to respond in
>> depth to any more questions that are really just "prove to me that I can't"
>> kind of questions.  Of course I can't prove that.   All I'm saying is that
>> to me the evidence and my experience leads me to not be able to support
>> major changes like you have proposed without also intentionally breaking
>> the ABI (and thus calling it NumPy 2.0).
>>
>> If I find time to write, I will try to use it to outline more
>> specifically what I think is a better approach to array- and
>> table-computing in Python that keeps the stability of NumPy and adds new
>> features using different approaches.
>>
>> -Travis
>>
>>
>
> From my perspective the incremental evolutionary approach in numpy (and
> scipy) in the last few years has worked quite well, and I'm optimistic that
> it will work in future if the developers can pull it off.
>
> The main changes that I remember that needed adjustment in scipy (as
> observer) or statsmodels (as maintainer) came from becoming more strict in
> several cases. This mainly affects corner cases or cases where the
> downstream code wasn't "clean". Some API breaking (with deprecation) and
> some semantic changes are still needed independent of any big changes that
> may or may not be arriving anytime soon.
>
> This way we get improvements in a core library with the requirement that
> every once in a while we need to adjust our code. (And with the occasional
> unintended side effect where test coverage is not enough.)
> The advantage is that we are getting the improvements with the regular
> release cycles, and they keep numpy alive and competitive for another 10
> years or more. In the meantime, other packages like pandas can cater and
> expand to other use cases, or other packages can develop generic arrays and
> out of core and distributed arrays.
>
> I'm partially following some of the Julia mailing lists. Starting
> something from scratch is a lot of work, and my guess is that similar
> approaches in python will take some time to become mainstream. In the
> meantime we can build something on an improving numpy.
>
> ---
> The only thing I'm not so happy about in the last years is the
> proliferation of object arrays, both in numpy code and in pandas. And I
> hope that the (dtype) proposals help to get rid of some of those object
> arrays.
>
>
> Josef
>
>
>>
>>
>>
>>
>>>
>>> On Tue, Aug 25, 2015 at 12:00 PM, Travis Oliphant <travis at continuum.io>
>>> wrote:
>>>
>>>> Thanks for the write-up Nathaniel.   There is a lot of great detail and
>>>> interesting ideas here.
>>>>
>>>> I've am very eager to understand how to help NumPy and the wider
>>>> community move forward however I can (my passions on this have not changed
>>>> since 1999, though what I myself spend time on has changed).
>>>>
>>>> There are a lot of ways to think about approaching this, though.   It's
>>>> hard to get all the ideas on the table, and it was unfortunate we couldn't
>>>> get everybody wyho are core NumPy devs together in person to have this
>>>> discussion as there are still a lot of questions unanswered and a lot of
>>>> thought that has gone into other approaches that was not brought up or
>>>> represented in the meeting (how does Numba fit into this, what about
>>>> data-shape, dynd, memory-views and Python type system, etc.).   If NumPy
>>>> becomes just an interface-specification, then why don't we just do that
>>>> *outside* NumPy itself in a way that doesn't jeopardize the stability of
>>>> NumPy today.    These are some of the real questions I have.   I will try
>>>> to write up my thoughts in more depth soon, but  I won't be able to respond
>>>> in-depth right now.   I just wanted to comment because Nathaniel said I
>>>> disagree which is only partly true.
>>>>
>>>> The three most important things for me are 1) let's make sure we have
>>>> representation from as wide of the community as possible (this is really
>>>> hard), 2) let's look around at the broader community and the prior art that
>>>> is happening in this space right now and 3) let's not pretend we are going
>>>> to be able to make all this happen without breaking ABI compatibility.
>>>> Let's just break ABI compatibility with NumPy 2.0 *and* have as much
>>>> fidelity with the API and semantics of current NumPy as possible (though
>>>> there will be some changes necessary long-term).
>>>>
>>>> I don't think we should intentionally break ABI if we can avoid it, but
>>>> I also don't think we should spend in-ordinate amounts of time trying to
>>>> pretend that we won't break ABI (for at least some people), and most
>>>> importantly we should not pretend *not* to break the ABI when we actually
>>>> do.    We did this once before with the roll-out of date-time, and it was
>>>> really un-necessary.     When I released NumPy 1.0, there were several
>>>> things that I knew should be fixed very soon (NumPy was never designed to
>>>> not break ABI).    Those problems are still there.    Now, that we have
>>>> quite a bit better understanding of what NumPy *should* be (there have been
>>>> tremendous strides in understanding and community size over the past 10
>>>> years), let's actually make the infrastructure we think will last for the
>>>> next 20 years (instead of trying to shoe-horn new ideas into a 20-year old
>>>> code-base that wasn't designed for it).
>>>>
>>>> NumPy is a hard code-base.  It has been since Numeric days in 1995.
>>>> I could be wrong, but my guess is that we will be passed by as a community
>>>> if we don't seize the opportunity to build something better than we can
>>>> build if we are forced to use a 20 year old code-base.
>>>>
>>>> It is more important to not break people's code and to be clear when a
>>>> re-compile is necessary for dependencies.   Those to me are the most
>>>> important constraints. There are a lot of great ideas that we all have
>>>> about what we want NumPy to be able to do.     Some of this are pretty
>>>> transformational (and the more exciting they are, the harder I think they
>>>> are going to be to implement without breaking at least the ABI).     There
>>>> is probably some CAP-like theorem around
>>>> Stability-Features-Speed-of-Development (pick 2) when it comes to Open
>>>> Source Software development and making feature-progress with NumPy *is
>>>> going* to create in-stability which concerns me.
>>>>
>>>> I would like to see a little-bit-of-pain one time with a NumPy 2.0,
>>>> rather than a constant pain because of constant churn over many years
>>>> approach that Nathaniel seems to advocate.   To me NumPy 2.0 is an
>>>> ABI-breaking release that is as API-compatible as possible and whose
>>>> semantics are not dramatically different.
>>>>
>>>> There are at least 3 areas of compatibility (ABI, API, and semantic).
>>>>  ABI-compatibility is a non-feature in today's world.   There are so many
>>>> distributions of the NumPy stack (and conda makes it trivial for anyone to
>>>> build their own or for you to build one yourself).   Making less-optimal
>>>> software-engineering choices because of fear of breaking the ABI is not
>>>> something I'm supportive of at all.   We should not break ABI every
>>>> release, but a release every 3 years that breaks ABI is not a problem.
>>>>
>>>> API compatibility should be much more sacrosanct, but it is also
>>>> something that can also be managed.   Any NumPy 2.0 should definitely
>>>> support the full NumPy API (though there could be deprecated swaths).    I
>>>> think the community has done well in using deprecation and limiting the
>>>> public API to make this more manageable and I would love to see a NumPy 2.0
>>>> that solidifies a future-oriented API along with a back-ward compatible API
>>>> that is also available.
>>>>
>>>> Semantic compatibility is the hardest.   We have already broken this on
>>>> multiple occasions throughout the 1.x NumPy releases.  Every time you
>>>> change the code, this can change.    This is what I fear causing deep
>>>> instability over the course of many years.     These are things like the
>>>> casting rule details,  the effect of indexing changes, any change to the
>>>> calculations approaches.     It is and has been the most at risk during any
>>>> code-changes.    My view is that a NumPy 2.0 (with a new low-level
>>>> architecture) minimizes these changes to a single release rather than
>>>> unavoidably spreading them out over many, many releases.
>>>>
>>>> I think that summarizes my main concerns.  I will write-up more forward
>>>> thinking ideas for what else is possible in the coming weeks.   In the mean
>>>> time, thanks for keeping the discussion going.  It is extremely exciting to
>>>> see the help people have continued to provide to maintain and improve
>>>> NumPy.    It will be exciting to see what the next few years bring as well.
>>>>
>>>>
>>>> Best,
>>>>
>>>> -Travis
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Aug 25, 2015 at 5:03 AM, Nathaniel Smith <njs at pobox.com> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> These are the notes from the NumPy dev meeting held July 7, 2015, at
>>>>> the SciPy conference in Austin, presented here so the list can keep up
>>>>> with what happens, and so you can give feedback. Please do give
>>>>> feedback, none of this is final!
>>>>>
>>>>> (Also, if anyone who was there notices anything I left out or
>>>>> mischaracterized, please speak up -- these are a lot of notes I'm
>>>>> trying to gather together, so I could easily have missed something!)
>>>>>
>>>>> Thanks to Jill Cowan and the rest of the SciPy organizers for donating
>>>>> space and organizing logistics for us, and to the Berkeley Institute
>>>>> for Data Science for funding travel for Jaime, Nathaniel, and
>>>>> Sebastian.
>>>>>
>>>>>
>>>>> Attendees
>>>>> =========
>>>>>
>>>>>   Present in the room for all or part: Daniel Allan, Chris Barker,
>>>>>   Sebastian Berg, Thomas Caswell, Jeff Reback, Jaime Fernández del
>>>>>   Río, Chuck Harris, Nathaniel Smith, Stéfan van der Walt. (Note: I'm
>>>>>   pretty sure this list is incomplete)
>>>>>
>>>>>   Joining remotely for all or part: Stephan Hoyer, Julian Taylor.
>>>>>
>>>>>
>>>>> Formalizing our governance/decision making
>>>>> ==========================================
>>>>>
>>>>>   This was a major focus of discussion. At a high level, the consensus
>>>>>   was to steal IPython's governance document ("IPEP 29") and modify it
>>>>>   to remove its use of a BDFL as a "backstop" to normal community
>>>>>   consensus-based decision, and replace it with a new "backstop" based
>>>>>   on Apache-project-style consensus voting amongst the core team.
>>>>>
>>>>>   I'll send out a proper draft of this shortly for further discussion.
>>>>>
>>>>>
>>>>> Development roadmap
>>>>> ===================
>>>>>
>>>>>   General consensus:
>>>>>
>>>>>   Let's assume NumPy is going to remain important indefinitely, and
>>>>>   try to make it better, instead of waiting for something better to
>>>>>   come along. (This is unlikely to be wasted effort even if something
>>>>>   better does come along, and it's hardly a sure thing that that will
>>>>>   happen anyway.)
>>>>>
>>>>>   Let's focus on evolving numpy as far as we can without major
>>>>>   break-the-world changes (no "numpy 2.0", at least in the foreseeable
>>>>>   future).
>>>>>
>>>>>   And, as a target for that evolution, let's change our focus from
>>>>>   numpy as "NumPy is the library that gives you the np.ndarray object
>>>>>   (plus some attached infrastructure)", to "NumPy provides the
>>>>>   standard framework for working with arrays and array-like objects in
>>>>>   Python"
>>>>>
>>>>>   This means, creating defined interfaces between array-like objects /
>>>>>   ufunc objects / dtype objects, so that it becomes possible for third
>>>>>   parties to add their own and mix-and-match. Right now ufuncs are
>>>>>   pretty good at this, but if you want a new array class or dtype then
>>>>>   in most cases you pretty much have to modify numpy itself.
>>>>>
>>>>>   Vision: instead of everyone who wants a new container type having to
>>>>>   reimplement all of numpy, Alice can implement an array class using
>>>>>   (sparse / distributed / compressed / tiled / gpu / out-of-core /
>>>>>   delayed / ...) storage, pass it to code that was written using
>>>>>   direct calls to np.* functions, and it just works. (Instead of
>>>>>   np.sin being "the way you calculate the sine of an ndarray", it's
>>>>>   "the way you calculate the sine of any array-like container
>>>>>   object".)
>>>>>
>>>>>   Vision: Darryl can implement a new dtype for (categorical data /
>>>>>   astronomical dates / integers-with-missing-values / ...) without
>>>>>   having to touch the numpy core.
>>>>>
>>>>>   Vision: Chandni can then come along and combine them by doing
>>>>>
>>>>>   a = alice_array([...], dtype=darryl_dtype)
>>>>>
>>>>>   and it just works.
>>>>>
>>>>>   Vision: no-one is tempted to subclass ndarray, because anything you
>>>>>   can do with an ndarray subclass you can also easily do by defining
>>>>>   your own new class that implements the "array protocol".
>>>>>
>>>>>
>>>>> Supporting third-party array types
>>>>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>>>>
>>>>>   Sub-goals:
>>>>>   - Get __numpy_ufunc__ done, which will cover a good chunk of numpy's
>>>>>     API right there.
>>>>>   - Go through the rest of the stuff in numpy, and figure out some
>>>>>     story for how to let it handle third-party array classes:
>>>>>     - ufunc ALL the things: Some things can be converted directly into
>>>>>       (g)ufuncs and then use __numpy_ufunc__ (e.g., np.std); some
>>>>>       things could be converted into (g)ufuncs if we extended the
>>>>>       (g)ufunc interface a bit (e.g. np.sort, np.matmul).
>>>>>     - Some things probably need their own __numpy_ufunc__-like
>>>>>       extensions (__numpy_concatenate__?)
>>>>>   - Provide tools to make it easier to implement the more complicated
>>>>>     parts of an array object (e.g. the bazillion different methods,
>>>>>     many of which are ufuncs in disguise, or indexing)
>>>>>   - Longer-run interesting research project: __numpy_ufunc__ requires
>>>>>     that one or the other object have explicit knowledge of how to
>>>>>     handle the other, so to handle binary ufuncs with N array types
>>>>>     you need something like N**2 __numpy_ufunc__ code paths. As an
>>>>>     alternative, if there were some interface that an object could
>>>>>     export that provided the operations nditer needs to efficiently
>>>>>     iterate over (chunks of) it, then you would only need N
>>>>>     implementations of this interface to handle all N**2 operations.
>>>>>
>>>>>   This would solve a lot of problems for projects like:
>>>>>   - blosc
>>>>>   - dask
>>>>>   - distarray
>>>>>   - numpy.ma
>>>>>   - pandas
>>>>>   - scipy.sparse
>>>>>   - xray
>>>>>
>>>>>
>>>>> Supporting third-party dtypes
>>>>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>>>>
>>>>>   We already have something like a C level "dtype
>>>>>   protocol". Conceptually, the way you define a new dtype is by
>>>>>   defining a new class whose instances have data attributes defining
>>>>>   the parameters of the dtype (what fields are in *this* record dtype,
>>>>>   how many characters are in *this* string dtype, what units are used
>>>>>   for *this* datetime64, etc.), and you define a bunch of methods to
>>>>>   do things like convert an object from a Python object to your dtype
>>>>>   or vice-versa, to copy an array of your dtype from one place to
>>>>>   another, to cast to and from your new dtype, etc. This part is
>>>>>   great.
>>>>>
>>>>>   The problem is, in the current implementation, we don't actually use
>>>>>   the Python object system to define these classes / attributes /
>>>>>   methods. Instead, all possible dtypes are jammed into a single
>>>>>   Python-level class, whose struct has fields for the union of all
>>>>>   possible dtype's attributes, and instead of Python-style method
>>>>>   slots there's just a big table of function pointers attached to each
>>>>>   object.
>>>>>
>>>>>   So the main proposal is that we keep the basic design, but switch it
>>>>>   so that the float64 dtype, the int64 dtype, etc. actually literally
>>>>>   are subclasses of np.dtype, each implementing their own fields and
>>>>>   Python-style methods.
>>>>>
>>>>>   Some of the pieces involved in doing this:
>>>>>
>>>>>   - The current dtype methods should be cleaned up -- e.g. 'dot' and
>>>>>     'less_than' are both dtype methods, when conceptually they're much
>>>>>     more like ufuncs.
>>>>>
>>>>>   - The ufunc inner-loop interface currently does not get a reference
>>>>>     to the dtype object, so they can't see its attributes and this is
>>>>>     a big obstacle to many interesting dtypes (e.g., it's hard to
>>>>>     implement np.equal for categoricals if you don't know what
>>>>>     categories each has). So we need to add new arguments to the core
>>>>>     ufunc loop signature. (Fortunately this can be done in a
>>>>>     backwards-compatible way.)
>>>>>
>>>>>   - We need to figure out what exactly the dtype methods should be,
>>>>>     and add them to the dtype class (possibly with backwards
>>>>>     compatibility shims for anyone who is accessing PyArray_ArrFuncs
>>>>>     directly).
>>>>>
>>>>>   - Casting will be possibly the trickiest thing to work out, though
>>>>>     the basic idea of using dunder-dispatch-like __cast__ and
>>>>>     __rcast__ methods seems workable. (Encouragingly, this is also
>>>>>     exactly what dynd also does, though unfortunately dynd does not
>>>>>     yet support user-defined dtypes even to the extent that numpy
>>>>>     does, so there isn't much else we can steal from them.)
>>>>>     - We may also want to rethink the casting rules while we're at it,
>>>>>       since they have some very weird corners right now (e.g. see
>>>>>       [https://github.com/numpy/numpy/issues/6240])
>>>>>
>>>>>   - We need to migrate the current dtypes over to the new system,
>>>>>     which can be done in stages:
>>>>>
>>>>>     - First stick them all in a single "legacy dtype" class whose
>>>>>       methods just dispatch to the PyArray_ArrFuncs per-object "method
>>>>>       table"
>>>>>
>>>>>     - Then move each of them into their own classes
>>>>>
>>>>>   - We should provide a Python-level wrapper for the protocol, so that
>>>>>     you can call dtype methods from Python
>>>>>
>>>>>   - And vice-versa, it should be possible to subclass dtype at the
>>>>>     Python level
>>>>>
>>>>>   - etc.
>>>>>
>>>>>   Fortunately, AFAICT pretty much all of this can be done while
>>>>>   maintaining backwards compatibility (though we may want to break
>>>>>   some obscure cases to avoid expending *too* much effort with weird
>>>>>   backcompat contortions that will only help a vanishingly small
>>>>>   proportion of the userbase), and a lot of the above changes can be
>>>>>   done as semi-independent mini-projects, so there's no need for some
>>>>>   branch to go off and spend a year rewriting the world.
>>>>>
>>>>>   Obviously there are still a lot of details to work out, though. But
>>>>>   overall, there was widespread agreement that this is one of the #1
>>>>>   pain points for our users (e.g. it's the single main request from
>>>>>   pandas), and fixing it is very high priority.
>>>>>
>>>>>   Some features that would become straightforward to implement
>>>>>   (e.g. even in third-party libraries) if this were fixed:
>>>>>   - missing value support
>>>>>   - physical unit tracking (meters / seconds -> array of velocity;
>>>>>     meters + seconds -> error)
>>>>>   - better and more diverse datetime representations (e.g. datetimes
>>>>>     with attached timezones, or using funky geophysical or
>>>>>     astronomical calendars)
>>>>>   - categorical data
>>>>>   - variable length strings
>>>>>   - strings-with-encodings (e.g. latin1)
>>>>>   - forward mode automatic differentiation (write a function that
>>>>>     computes f(x) where x is an array of float64; pass that function
>>>>>     an array with a special dtype and get out both f(x) and f'(x))
>>>>>   - probably others I'm forgetting right now
>>>>>
>>>>>   I should also note that there was one substantial objection to this
>>>>>   plan, from Travis Oliphant (in discussions later in the
>>>>>   conference). I'm not confident I understand his objections well
>>>>>   enough to reproduce them here, though -- perhaps he'll elaborate.
>>>>>
>>>>>
>>>>> Money
>>>>> =====
>>>>>
>>>>>   There was an extensive discussion on the topic of: "if we had money,
>>>>>   what would we do with it?"
>>>>>
>>>>>   This is partially motivated by the realization that there are a
>>>>>   number of sources that we could probably get money from, if we had a
>>>>>   good story for what we wanted to do, so it's not just an idle
>>>>>   question.
>>>>>
>>>>>   Points of general agreement:
>>>>>
>>>>>   - Doing the in-person meeting was a good thing. We should plan do
>>>>>     that again, at least once a year. So one thing to spend money on
>>>>>     is travel subsidies to make sure that happens and is productive.
>>>>>
>>>>>   - While it's tempting to imagine hiring junior people for the more
>>>>>     frustrating/boring work like maintaining buildbots, release
>>>>>     infrastructure, updating docs, etc., this seems difficult to do
>>>>>     realistically with our current resources -- how do we hire for
>>>>>     this, who would manage them, etc.?
>>>>>
>>>>>   - On the other hand, the general feeling was that if we found the
>>>>>     money to hire a few more senior people who could take care of
>>>>>     themselves more, then that would be good and we could
>>>>>     realistically absorb that extra work without totally unbalancing
>>>>>     the project.
>>>>>
>>>>>     - A major open question is how we would recruit someone for a
>>>>>       position like this, since apparently all the obvious candidates
>>>>>       who are already active on the NumPy team already have other
>>>>>       things going on. [For calibration on how hard this can be: NYU
>>>>>       has apparently had an open position for a year with the job
>>>>>       description of "come work at NYU full-time with a
>>>>>       private-industry-competitive-salary on whatever your personal
>>>>>       open-source scientific project is" (!) and still is having an
>>>>>       extremely difficult time filling it:
>>>>>       [http://cds.nyu.edu/research-engineer/]]
>>>>>
>>>>>     - General consensus though was that there isn't much to be done
>>>>>       about this though, except try it and see.
>>>>>
>>>>>     - (By the way, if you're someone who's reading this and
>>>>>       potentially interested in like a postdoc or better working on
>>>>>       numpy, then let's talk...)
>>>>>
>>>>>
>>>>> More specific changes to numpy that had general consensus, but don't
>>>>> really fit into a high-level roadmap
>>>>>
>>>>> =========================================================================================================
>>>>>
>>>>>   - Resolved: we should merge multiarray.so and umath.so into a single
>>>>>     extension module, so that they can share utility code without the
>>>>>     current awkward contortions.
>>>>>
>>>>>   - Resolved: we should start hiding new fields in the ufunc and dtype
>>>>>     structs as soon as possible going forward. (I.e. they would not be
>>>>>     present in the version of the structs that are exposed through the
>>>>>     C API, but internally we would use a more detailed struct.)
>>>>>     - Mayyyyyybe we should even go ahead and hide the subset of the
>>>>>       existing fields that are really internal details that no-one
>>>>>       should be using. If we did this without changing anything else
>>>>>       then it would preserve ABI (the fields would still be where
>>>>>       existing compiled extensions expect them to be, if any such
>>>>>       extensions exist) while breaking API (trying to compile such
>>>>>       extensions would give a clear error), so would be a smoother
>>>>>       ramp if we think we need to eventually break those fields for
>>>>>       real. (As discussed above, there are a bunch of fields in the
>>>>>       dtype base class that only make sense for specific dtype
>>>>>       subclasses, e.g. only record dtypes need a list of field names,
>>>>>       but right now all dtypes have one anyway. So it would be nice to
>>>>>       remove these from the base class entirely, but that is
>>>>>       potentially ABI-breaking.)
>>>>>
>>>>>   - Resolved: np.array should never return an object array unless
>>>>>     explicitly requested (e.g. with dtype=object); it just causes too
>>>>>     many surprising problems.
>>>>>     - First step: add a deprecation warning
>>>>>     - Eventually: make it an error.
>>>>>
>>>>>   - The matrix class
>>>>>     - Resolved: We won't add warnings yet, but we will prominently
>>>>>       document that it is deprecated and should be avoided where-ever
>>>>>       possible.
>>>>>       - Stéfan van der Walt volunteers to do this.
>>>>>     - We'd all like to deprecate it properly, but the feeling was that
>>>>>       the precondition for this is for scipy.sparse to provide sparse
>>>>>       "arrays" that don't return np.matrix objects on ordinary
>>>>>       operatoins. Until that happens we can't reasonably tell people
>>>>>       that using np.matrix is a bug.
>>>>>
>>>>>   - Resolved: we should add a similar prominent note to the
>>>>>     "subclassing ndarray" documentation, warning people that this is
>>>>>     painful and barely works and please don't do it if you have any
>>>>>     alternatives.
>>>>>
>>>>>   - Resolved: we want more, smaller releases -- every 6 months at
>>>>>     least, aiming to go even faster (every 4 months?)
>>>>>
>>>>>   - On the question of using Cython inside numpy core:
>>>>>     - Everyone agrees that there are places where this would be an
>>>>>       improvement (e.g., Python<->C interfaces, and places "when you
>>>>>       want to do computer science", e.g. complicated algorithmic stuff
>>>>>       like graph traversals)
>>>>>     - Chuck wanted it to be clear though that he doesn't think it
>>>>>       would be a good goal to try and rewrite all of numpy in Cython
>>>>>       -- there also exist places where Cython ends up being "an uglier
>>>>>       version of C". No-one disagreed.
>>>>>
>>>>>   - Our text reader is apparently not very functional on Python 3, and
>>>>>     generally slow and hard to work with.
>>>>>     - Resolved: We should extract Pandas's awesome text reader/parser
>>>>>       and convert it into its own package, that could then become a
>>>>>       new backend for both pandas and numpy.loadtxt.
>>>>>     - Jeff thinks this is a great idea
>>>>>     - Thomas Caswell volunteers to do the extraction.
>>>>>
>>>>>   - We should work on improving our tools for evolving the ABI, so
>>>>>     that we will eventually be less constrained by decisions made
>>>>>     decades ago.
>>>>>     - One idea that had a lot of support was to switch from our
>>>>>       current append-only C-API to a "sliding window" API based on
>>>>>       explicit versions. So a downstream package might say
>>>>>
>>>>>       #define NUMPY_API_VERSION 4
>>>>>
>>>>>       and they'd get the functions and behaviour provided in "version
>>>>>       4" of the numpy C api. If they wanted to get access to new stuff
>>>>>       that was added in version 5, then they'd need to switch that
>>>>>       #define, and at the same time clean up any usage of stuff that
>>>>>       was removed or changed in version 5. And to provide a smooth
>>>>>       migration path, one version of numpy would support multiple
>>>>>       versions at once, gradually deprecating and dropping old
>>>>>       versions.
>>>>>
>>>>>     - If anyone wants to help bring pip up to scratch WRT tracking ABI
>>>>>       dependencies (e.g., 'pip install numpy==<version with new ABI>'
>>>>>       -> triggers rebuild of scipy against the new ABI), then that
>>>>>       would be an extremely useful thing.
>>>>>
>>>>>
>>>>> Policies that should be documented
>>>>> ==================================
>>>>>
>>>>>   ...together with some notes about what the contents of the document
>>>>>   should be:
>>>>>
>>>>>
>>>>> How we manage bugs in the bug tracker.
>>>>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>>>>
>>>>>   - Github "milestones" should *only* be assigned to release-blocker
>>>>>     bugs (which mostly means "regression from the last release").
>>>>>
>>>>>     In particular, if you're tempted to push a bug forward to the next
>>>>>     release... then it's clearly not a blocker, so don't set it to the
>>>>>     next release's milestone, just remove the milestone entirely.
>>>>>
>>>>>     (Obvious exception to this: deprecation followup bugs where we
>>>>>     decide that we want to keep the deprecation around a bit longer
>>>>>     are a case where a bug actually does switch from being a blocker
>>>>>     to release 1.x to being a blocker for release 1.(x+1).)
>>>>>
>>>>>   - Don't hesitate to close an issue if there's no way forward --
>>>>>     e.g. a PR where the author has disappeared. Just post a link to
>>>>>     this policy and close, with a polite note that we need to keep our
>>>>>     tracker useful as a todo list, but they're welcome to re-open if
>>>>>     things change.
>>>>>
>>>>>
>>>>> Deprecations and breakage policy:
>>>>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>>>>
>>>>>   - How long do we need to keep DeprecationWarnings around before we
>>>>>     break things? This is tricky because on the one hand an aggressive
>>>>>     (short) deprecation period lets us deliver new features and
>>>>>     important cleanups more quickly, but on the other hand a
>>>>>     too-aggressive deprecation period is difficult for our more
>>>>>     conservative downstream users.
>>>>>
>>>>>     - Idea that had the most support: pick a somewhat-aggressive
>>>>>       warning period as our default, and make a rule that if someone
>>>>>       asks for an extension during the beta cycle for the release that
>>>>>       removes it, then we put it back for another release or two worth
>>>>>       of grace period. (While also possibly upgrading the warning to
>>>>>       be more visible during the grace period.) This gives us
>>>>>       deprecation periods that are more adaptive on a case-by-case
>>>>>       basis.
>>>>>
>>>>>   - Lament: it would be really nice if we could get more people to
>>>>>     test our beta releases, because in practice right now 1.x.0 ends
>>>>>     up being where we actually the discover all the bugs, and 1.x.1 is
>>>>>     where it actually becomes usable. Which sucks, and makes it
>>>>>     difficult to have a solid policy about what counts as a
>>>>>     regression, etc. Is there anything we can do about this?
>>>>>
>>>>>   - ABI breakage: we distinguish between an ABI break that breaks
>>>>>     everything (e.g., "import scipy" segfaults), versus an ABI break
>>>>>     that breaks an occasional rare case (e.g., only apps that poke
>>>>>     around in some obscure corner of some struct are affected).
>>>>>
>>>>>     - The "break-the-world" type remains off-limit for now: the pain
>>>>>       is still too large (conda helps, but there are lots of people
>>>>>       who don't use conda!), and there aren't really any compelling
>>>>>       improvements that this would enable anyway.
>>>>>
>>>>>     - For the "break-0.1%-of-users" type, it is *not* ruled out by
>>>>>       fiat, though we remain conservative: we should treat it like
>>>>>       other API breaks in principle, and do a careful case-by-case
>>>>>       analysis of the details of the situation, taking into account
>>>>>       what kind of code would be broken, how common these cases are,
>>>>>       how important the benefits are, whether there are any specific
>>>>>       mitigation strategies we can use, etc. -- with this process of
>>>>>       course taking into account that a segfault is nastier than a
>>>>>       Python exception.
>>>>>
>>>>>
>>>>> Other points that were discussed
>>>>> ================================
>>>>>
>>>>>   - There was inconclusive discussion of what we should do with dot()
>>>>>     in the places where it disagrees with the PEP 465 matmul semantics
>>>>>     (specifically this is when both arguments have ndim >= 3, or one
>>>>>     argument has ndim == 0).
>>>>>     - The concern is that the current behavior is not very useful, and
>>>>>       as far as we can tell no-one is using it; but, as people get
>>>>>       used to the more-useful PEP 465 behavior, they will increasingly
>>>>>       try to use it on the assumption that np.dot will work the same
>>>>>       way, and this will create pain for lots of people. So Nathaniel
>>>>>       argued that we should start at least issuing a visible warning
>>>>>       when people invoke the corner-case behavior.
>>>>>     - But OTOH, np.dot is such a core piece of infrastructure, and
>>>>>       there's such a large landscape of code out there using numpy
>>>>>       that we can't see, that others were reasonably wary of making
>>>>>       any change.
>>>>>     - For now: document prominently, but no change in behavior.
>>>>>
>>>>>
>>>>> Links to raw notes
>>>>> ==================
>>>>>
>>>>>   Main page:
>>>>>   [https://github.com/numpy/numpy/wiki/SciPy-2015-developer-meeting]
>>>>>
>>>>>   Notes from the meeting proper:
>>>>>   [
>>>>> https://docs.google.com/document/d/1IJcYdsHtk8MVAM4AZqFDBSf_nVG-mrB4Tv2bh9u1g4Y/edit?usp=sharing
>>>>> ]
>>>>>
>>>>>   Slides from the followup BoF:
>>>>>   [
>>>>> https://gist.github.com/njsmith/eb42762054c88e810786/raw/b74f978ce10a972831c582485c80fb5b8e68183b/future-of-numpy-bof.odp
>>>>> ]
>>>>>
>>>>>   Notes from the followup BoF:
>>>>>   [
>>>>> https://docs.google.com/document/d/11AuTPms5dIPo04JaBOWEoebXfk-tUzEZ-CvFnLIt33w/edit
>>>>> ]
>>>>>
>>>>> -n
>>>>>
>>>>> --
>>>>> Nathaniel J. Smith -- http://vorpus.org
>>>>> _______________________________________________
>>>>> NumPy-Discussion mailing list
>>>>> NumPy-Discussion at scipy.org
>>>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> *Travis Oliphant*
>>>> *Co-founder and CEO*
>>>>
>>>>
>>>> @teoliphant
>>>> 512-222-5440
>>>> http://www.continuum.io
>>>>
>>>> _______________________________________________
>>>> NumPy-Discussion mailing list
>>>> NumPy-Discussion at scipy.org
>>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>>>
>>>>
>>>
>>>
>>> --
>>> Nathaniel J. Smith -- http://vorpus.org
>>>
>>> _______________________________________________
>>> NumPy-Discussion mailing list
>>> NumPy-Discussion at scipy.org
>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>>
>>>
>>
>>
>> --
>>
>> *Travis Oliphant*
>> *Co-founder and CEO*
>>
>>
>> @teoliphant
>> 512-222-5440
>> http://www.continuum.io
>>
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion at scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>
>>
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20150831/e5a7624e/attachment.html>


More information about the NumPy-Discussion mailing list