
On Tue, Aug 25, 2015 at 4:03 AM, Nathaniel Smith <njs@pobox.com> wrote:
Hi all,
These are the notes from the NumPy dev meeting held July 7, 2015, at the SciPy conference in Austin, presented here so the list can keep up with what happens, and so you can give feedback. Please do give feedback, none of this is final!
(Also, if anyone who was there notices anything I left out or mischaracterized, please speak up -- these are a lot of notes I'm trying to gather together, so I could easily have missed something!)
Thanks to Jill Cowan and the rest of the SciPy organizers for donating space and organizing logistics for us, and to the Berkeley Institute for Data Science for funding travel for Jaime, Nathaniel, and Sebastian.
Attendees =========
Present in the room for all or part: Daniel Allan, Chris Barker, Sebastian Berg, Thomas Caswell, Jeff Reback, Jaime Fernández del Río, Chuck Harris, Nathaniel Smith, Stéfan van der Walt. (Note: I'm pretty sure this list is incomplete)
Joining remotely for all or part: Stephan Hoyer, Julian Taylor.
Formalizing our governance/decision making ==========================================
This was a major focus of discussion. At a high level, the consensus was to steal IPython's governance document ("IPEP 29") and modify it to remove its use of a BDFL as a "backstop" to normal community consensus-based decision, and replace it with a new "backstop" based on Apache-project-style consensus voting amongst the core team.
I'll send out a proper draft of this shortly for further discussion.
Development roadmap ===================
General consensus:
Let's assume NumPy is going to remain important indefinitely, and try to make it better, instead of waiting for something better to come along. (This is unlikely to be wasted effort even if something better does come along, and it's hardly a sure thing that that will happen anyway.)
Let's focus on evolving numpy as far as we can without major break-the-world changes (no "numpy 2.0", at least in the foreseeable future).
And, as a target for that evolution, let's change our focus from numpy as "NumPy is the library that gives you the np.ndarray object (plus some attached infrastructure)", to "NumPy provides the standard framework for working with arrays and array-like objects in Python"
This means, creating defined interfaces between array-like objects / ufunc objects / dtype objects, so that it becomes possible for third parties to add their own and mix-and-match. Right now ufuncs are pretty good at this, but if you want a new array class or dtype then in most cases you pretty much have to modify numpy itself.
Vision: instead of everyone who wants a new container type having to reimplement all of numpy, Alice can implement an array class using (sparse / distributed / compressed / tiled / gpu / out-of-core / delayed / ...) storage, pass it to code that was written using direct calls to np.* functions, and it just works. (Instead of np.sin being "the way you calculate the sine of an ndarray", it's "the way you calculate the sine of any array-like container object".)
Vision: Darryl can implement a new dtype for (categorical data / astronomical dates / integers-with-missing-values / ...) without having to touch the numpy core.
Vision: Chandni can then come along and combine them by doing
a = alice_array([...], dtype=darryl_dtype)
and it just works.
Vision: no-one is tempted to subclass ndarray, because anything you can do with an ndarray subclass you can also easily do by defining your own new class that implements the "array protocol".
Supporting third-party array types ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Sub-goals: - Get __numpy_ufunc__ done, which will cover a good chunk of numpy's API right there. - Go through the rest of the stuff in numpy, and figure out some story for how to let it handle third-party array classes: - ufunc ALL the things: Some things can be converted directly into (g)ufuncs and then use __numpy_ufunc__ (e.g., np.std); some things could be converted into (g)ufuncs if we extended the (g)ufunc interface a bit (e.g. np.sort, np.matmul). - Some things probably need their own __numpy_ufunc__-like extensions (__numpy_concatenate__?) - Provide tools to make it easier to implement the more complicated parts of an array object (e.g. the bazillion different methods, many of which are ufuncs in disguise, or indexing) - Longer-run interesting research project: __numpy_ufunc__ requires that one or the other object have explicit knowledge of how to handle the other, so to handle binary ufuncs with N array types you need something like N**2 __numpy_ufunc__ code paths. As an alternative, if there were some interface that an object could export that provided the operations nditer needs to efficiently iterate over (chunks of) it, then you would only need N implementations of this interface to handle all N**2 operations.
This would solve a lot of problems for projects like: - blosc - dask - distarray - numpy.ma - pandas - scipy.sparse - xray
Supporting third-party dtypes ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
We already have something like a C level "dtype protocol". Conceptually, the way you define a new dtype is by defining a new class whose instances have data attributes defining the parameters of the dtype (what fields are in *this* record dtype, how many characters are in *this* string dtype, what units are used for *this* datetime64, etc.), and you define a bunch of methods to do things like convert an object from a Python object to your dtype or vice-versa, to copy an array of your dtype from one place to another, to cast to and from your new dtype, etc. This part is great.
The problem is, in the current implementation, we don't actually use the Python object system to define these classes / attributes / methods. Instead, all possible dtypes are jammed into a single Python-level class, whose struct has fields for the union of all possible dtype's attributes, and instead of Python-style method slots there's just a big table of function pointers attached to each object.
So the main proposal is that we keep the basic design, but switch it so that the float64 dtype, the int64 dtype, etc. actually literally are subclasses of np.dtype, each implementing their own fields and Python-style methods.
Some of the pieces involved in doing this:
- The current dtype methods should be cleaned up -- e.g. 'dot' and 'less_than' are both dtype methods, when conceptually they're much more like ufuncs.
- The ufunc inner-loop interface currently does not get a reference to the dtype object, so they can't see its attributes and this is a big obstacle to many interesting dtypes (e.g., it's hard to implement np.equal for categoricals if you don't know what categories each has). So we need to add new arguments to the core ufunc loop signature. (Fortunately this can be done in a backwards-compatible way.)
- We need to figure out what exactly the dtype methods should be, and add them to the dtype class (possibly with backwards compatibility shims for anyone who is accessing PyArray_ArrFuncs directly).
- Casting will be possibly the trickiest thing to work out, though the basic idea of using dunder-dispatch-like __cast__ and __rcast__ methods seems workable. (Encouragingly, this is also exactly what dynd also does, though unfortunately dynd does not yet support user-defined dtypes even to the extent that numpy does, so there isn't much else we can steal from them.) - We may also want to rethink the casting rules while we're at it, since they have some very weird corners right now (e.g. see [https://github.com/numpy/numpy/issues/6240])
- We need to migrate the current dtypes over to the new system, which can be done in stages:
- First stick them all in a single "legacy dtype" class whose methods just dispatch to the PyArray_ArrFuncs per-object "method table"
- Then move each of them into their own classes
- We should provide a Python-level wrapper for the protocol, so that you can call dtype methods from Python
- And vice-versa, it should be possible to subclass dtype at the Python level
- etc.
Fortunately, AFAICT pretty much all of this can be done while maintaining backwards compatibility (though we may want to break some obscure cases to avoid expending *too* much effort with weird backcompat contortions that will only help a vanishingly small proportion of the userbase), and a lot of the above changes can be done as semi-independent mini-projects, so there's no need for some branch to go off and spend a year rewriting the world.
Obviously there are still a lot of details to work out, though. But overall, there was widespread agreement that this is one of the #1 pain points for our users (e.g. it's the single main request from pandas), and fixing it is very high priority.
Some features that would become straightforward to implement (e.g. even in third-party libraries) if this were fixed: - missing value support - physical unit tracking (meters / seconds -> array of velocity; meters + seconds -> error) - better and more diverse datetime representations (e.g. datetimes with attached timezones, or using funky geophysical or astronomical calendars) - categorical data - variable length strings - strings-with-encodings (e.g. latin1) - forward mode automatic differentiation (write a function that computes f(x) where x is an array of float64; pass that function an array with a special dtype and get out both f(x) and f'(x)) - probably others I'm forgetting right now
I should also note that there was one substantial objection to this plan, from Travis Oliphant (in discussions later in the conference). I'm not confident I understand his objections well enough to reproduce them here, though -- perhaps he'll elaborate.
Money =====
There was an extensive discussion on the topic of: "if we had money, what would we do with it?"
This is partially motivated by the realization that there are a number of sources that we could probably get money from, if we had a good story for what we wanted to do, so it's not just an idle question.
Points of general agreement:
- Doing the in-person meeting was a good thing. We should plan do that again, at least once a year. So one thing to spend money on is travel subsidies to make sure that happens and is productive.
- While it's tempting to imagine hiring junior people for the more frustrating/boring work like maintaining buildbots, release infrastructure, updating docs, etc., this seems difficult to do realistically with our current resources -- how do we hire for this, who would manage them, etc.?
- On the other hand, the general feeling was that if we found the money to hire a few more senior people who could take care of themselves more, then that would be good and we could realistically absorb that extra work without totally unbalancing the project.
- A major open question is how we would recruit someone for a position like this, since apparently all the obvious candidates who are already active on the NumPy team already have other things going on. [For calibration on how hard this can be: NYU has apparently had an open position for a year with the job description of "come work at NYU full-time with a private-industry-competitive-salary on whatever your personal open-source scientific project is" (!) and still is having an extremely difficult time filling it: [http://cds.nyu.edu/research-engineer/]]
- General consensus though was that there isn't much to be done about this though, except try it and see.
- (By the way, if you're someone who's reading this and potentially interested in like a postdoc or better working on numpy, then let's talk...)
More specific changes to numpy that had general consensus, but don't really fit into a high-level roadmap
=========================================================================================================
- Resolved: we should merge multiarray.so and umath.so into a single extension module, so that they can share utility code without the current awkward contortions.
- Resolved: we should start hiding new fields in the ufunc and dtype structs as soon as possible going forward. (I.e. they would not be present in the version of the structs that are exposed through the C API, but internally we would use a more detailed struct.) - Mayyyyyybe we should even go ahead and hide the subset of the existing fields that are really internal details that no-one should be using. If we did this without changing anything else then it would preserve ABI (the fields would still be where existing compiled extensions expect them to be, if any such extensions exist) while breaking API (trying to compile such extensions would give a clear error), so would be a smoother ramp if we think we need to eventually break those fields for real. (As discussed above, there are a bunch of fields in the dtype base class that only make sense for specific dtype subclasses, e.g. only record dtypes need a list of field names, but right now all dtypes have one anyway. So it would be nice to remove these from the base class entirely, but that is potentially ABI-breaking.)
- Resolved: np.array should never return an object array unless explicitly requested (e.g. with dtype=object); it just causes too many surprising problems. - First step: add a deprecation warning - Eventually: make it an error.
- The matrix class - Resolved: We won't add warnings yet, but we will prominently document that it is deprecated and should be avoided where-ever possible. - Stéfan van der Walt volunteers to do this. - We'd all like to deprecate it properly, but the feeling was that the precondition for this is for scipy.sparse to provide sparse "arrays" that don't return np.matrix objects on ordinary operatoins. Until that happens we can't reasonably tell people that using np.matrix is a bug.
- Resolved: we should add a similar prominent note to the "subclassing ndarray" documentation, warning people that this is painful and barely works and please don't do it if you have any alternatives.
- Resolved: we want more, smaller releases -- every 6 months at least, aiming to go even faster (every 4 months?)
- On the question of using Cython inside numpy core: - Everyone agrees that there are places where this would be an improvement (e.g., Python<->C interfaces, and places "when you want to do computer science", e.g. complicated algorithmic stuff like graph traversals) - Chuck wanted it to be clear though that he doesn't think it would be a good goal to try and rewrite all of numpy in Cython -- there also exist places where Cython ends up being "an uglier version of C". No-one disagreed.
- Our text reader is apparently not very functional on Python 3, and generally slow and hard to work with. - Resolved: We should extract Pandas's awesome text reader/parser and convert it into its own package, that could then become a new backend for both pandas and numpy.loadtxt. - Jeff thinks this is a great idea - Thomas Caswell volunteers to do the extraction.
- We should work on improving our tools for evolving the ABI, so that we will eventually be less constrained by decisions made decades ago. - One idea that had a lot of support was to switch from our current append-only C-API to a "sliding window" API based on explicit versions. So a downstream package might say
#define NUMPY_API_VERSION 4
and they'd get the functions and behaviour provided in "version 4" of the numpy C api. If they wanted to get access to new stuff that was added in version 5, then they'd need to switch that #define, and at the same time clean up any usage of stuff that was removed or changed in version 5. And to provide a smooth migration path, one version of numpy would support multiple versions at once, gradually deprecating and dropping old versions.
- If anyone wants to help bring pip up to scratch WRT tracking ABI dependencies (e.g., 'pip install numpy==<version with new ABI>' -> triggers rebuild of scipy against the new ABI), then that would be an extremely useful thing.
Policies that should be documented ==================================
...together with some notes about what the contents of the document should be:
How we manage bugs in the bug tracker. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- Github "milestones" should *only* be assigned to release-blocker bugs (which mostly means "regression from the last release").
In particular, if you're tempted to push a bug forward to the next release... then it's clearly not a blocker, so don't set it to the next release's milestone, just remove the milestone entirely.
(Obvious exception to this: deprecation followup bugs where we decide that we want to keep the deprecation around a bit longer are a case where a bug actually does switch from being a blocker to release 1.x to being a blocker for release 1.(x+1).)
- Don't hesitate to close an issue if there's no way forward -- e.g. a PR where the author has disappeared. Just post a link to this policy and close, with a polite note that we need to keep our tracker useful as a todo list, but they're welcome to re-open if things change.
Deprecations and breakage policy: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- How long do we need to keep DeprecationWarnings around before we break things? This is tricky because on the one hand an aggressive (short) deprecation period lets us deliver new features and important cleanups more quickly, but on the other hand a too-aggressive deprecation period is difficult for our more conservative downstream users.
- Idea that had the most support: pick a somewhat-aggressive warning period as our default, and make a rule that if someone asks for an extension during the beta cycle for the release that removes it, then we put it back for another release or two worth of grace period. (While also possibly upgrading the warning to be more visible during the grace period.) This gives us deprecation periods that are more adaptive on a case-by-case basis.
- Lament: it would be really nice if we could get more people to test our beta releases, because in practice right now 1.x.0 ends up being where we actually the discover all the bugs, and 1.x.1 is where it actually becomes usable. Which sucks, and makes it difficult to have a solid policy about what counts as a regression, etc. Is there anything we can do about this?
- ABI breakage: we distinguish between an ABI break that breaks everything (e.g., "import scipy" segfaults), versus an ABI break that breaks an occasional rare case (e.g., only apps that poke around in some obscure corner of some struct are affected).
- The "break-the-world" type remains off-limit for now: the pain is still too large (conda helps, but there are lots of people who don't use conda!), and there aren't really any compelling improvements that this would enable anyway.
- For the "break-0.1%-of-users" type, it is *not* ruled out by fiat, though we remain conservative: we should treat it like other API breaks in principle, and do a careful case-by-case analysis of the details of the situation, taking into account what kind of code would be broken, how common these cases are, how important the benefits are, whether there are any specific mitigation strategies we can use, etc. -- with this process of course taking into account that a segfault is nastier than a Python exception.
Other points that were discussed ================================
- There was inconclusive discussion of what we should do with dot() in the places where it disagrees with the PEP 465 matmul semantics (specifically this is when both arguments have ndim >= 3, or one argument has ndim == 0). - The concern is that the current behavior is not very useful, and as far as we can tell no-one is using it; but, as people get used to the more-useful PEP 465 behavior, they will increasingly try to use it on the assumption that np.dot will work the same way, and this will create pain for lots of people. So Nathaniel argued that we should start at least issuing a visible warning when people invoke the corner-case behavior. - But OTOH, np.dot is such a core piece of infrastructure, and there's such a large landscape of code out there using numpy that we can't see, that others were reasonably wary of making any change. - For now: document prominently, but no change in behavior.
Links to raw notes ==================
Main page: [https://github.com/numpy/numpy/wiki/SciPy-2015-developer-meeting]
Notes from the meeting proper: [ https://docs.google.com/document/d/1IJcYdsHtk8MVAM4AZqFDBSf_nVG-mrB4Tv2bh9u1... ]
Slides from the followup BoF: [ https://gist.github.com/njsmith/eb42762054c88e810786/raw/b74f978ce10a972831c... ]
Notes from the followup BoF: [ https://docs.google.com/document/d/11AuTPms5dIPo04JaBOWEoebXfk-tUzEZ-CvFnLIt... ]
-n
Hi Nathaniel. Thanks for putting this together. Chuck