New subject: Notes from the numpy dev meeting at scipy 2015

Aug. 25, 2015

      Hi all,

These are the notes from the NumPy dev meeting held July 7, 2015, at
the SciPy conference in Austin, presented here so the list can keep up
with what happens, and so you can give feedback. Please do give
feedback, none of this is final!

(Also, if anyone who was there notices anything I left out or
mischaracterized, please speak up -- these are a lot of notes I'm
trying to gather together, so I could easily have missed something!)

Thanks to Jill Cowan and the rest of the SciPy organizers for donating
space and organizing logistics for us, and to the Berkeley Institute
for Data Science for funding travel for Jaime, Nathaniel, and
Sebastian.

Attendees
=========

  Present in the room for all or part: Daniel Allan, Chris Barker,
  Sebastian Berg, Thomas Caswell, Jeff Reback, Jaime Fernández del
  Río, Chuck Harris, Nathaniel Smith, Stéfan van der Walt. (Note: I'm
  pretty sure this list is incomplete)

  Joining remotely for all or part: Stephan Hoyer, Julian Taylor.

Formalizing our governance/decision making
==========================================

  This was a major focus of discussion. At a high level, the consensus
  was to steal IPython's governance document ("IPEP 29") and modify it
  to remove its use of a BDFL as a "backstop" to normal community
  consensus-based decision, and replace it with a new "backstop" based
  on Apache-project-style consensus voting amongst the core team.

  I'll send out a proper draft of this shortly for further discussion.

Development roadmap
===================

  General consensus:

  Let's assume NumPy is going to remain important indefinitely, and
  try to make it better, instead of waiting for something better to
  come along. (This is unlikely to be wasted effort even if something
  better does come along, and it's hardly a sure thing that that will
  happen anyway.)

  Let's focus on evolving numpy as far as we can without major
  break-the-world changes (no "numpy 2.0", at least in the foreseeable
  future).

  And, as a target for that evolution, let's change our focus from
  numpy as "NumPy is the library that gives you the np.ndarray object
  (plus some attached infrastructure)", to "NumPy provides the
  standard framework for working with arrays and array-like objects in
  Python"

  This means, creating defined interfaces between array-like objects /
  ufunc objects / dtype objects, so that it becomes possible for third
  parties to add their own and mix-and-match. Right now ufuncs are
  pretty good at this, but if you want a new array class or dtype then
  in most cases you pretty much have to modify numpy itself.

  Vision: instead of everyone who wants a new container type having to
  reimplement all of numpy, Alice can implement an array class using
  (sparse / distributed / compressed / tiled / gpu / out-of-core /
  delayed / ...) storage, pass it to code that was written using
  direct calls to np.* functions, and it just works. (Instead of
  np.sin being "the way you calculate the sine of an ndarray", it's
  "the way you calculate the sine of any array-like container
  object".)

  Vision: Darryl can implement a new dtype for (categorical data /
  astronomical dates / integers-with-missing-values / ...) without
  having to touch the numpy core.

  Vision: Chandni can then come along and combine them by doing

  a = alice_array([...], dtype=darryl_dtype)

  and it just works.

  Vision: no-one is tempted to subclass ndarray, because anything you
  can do with an ndarray subclass you can also easily do by defining
  your own new class that implements the "array protocol".

Supporting third-party array types
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

  Sub-goals:
  - Get __numpy_ufunc__ done, which will cover a good chunk of numpy's
    API right there.
  - Go through the rest of the stuff in numpy, and figure out some
    story for how to let it handle third-party array classes:
    - ufunc ALL the things: Some things can be converted directly into
      (g)ufuncs and then use __numpy_ufunc__ (e.g., np.std); some
      things could be converted into (g)ufuncs if we extended the
      (g)ufunc interface a bit (e.g. np.sort, np.matmul).
    - Some things probably need their own __numpy_ufunc__-like
      extensions (__numpy_concatenate__?)
  - Provide tools to make it easier to implement the more complicated
    parts of an array object (e.g. the bazillion different methods,
    many of which are ufuncs in disguise, or indexing)
  - Longer-run interesting research project: __numpy_ufunc__ requires
    that one or the other object have explicit knowledge of how to
    handle the other, so to handle binary ufuncs with N array types
    you need something like N**2 __numpy_ufunc__ code paths. As an
    alternative, if there were some interface that an object could
    export that provided the operations nditer needs to efficiently
    iterate over (chunks of) it, then you would only need N
    implementations of this interface to handle all N**2 operations.

  This would solve a lot of problems for projects like:
  - blosc
  - dask
  - distarray
  - numpy.ma
  - pandas
  - scipy.sparse
  - xray

Supporting third-party dtypes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

  We already have something like a C level "dtype
  protocol". Conceptually, the way you define a new dtype is by
  defining a new class whose instances have data attributes defining
  the parameters of the dtype (what fields are in *this* record dtype,
  how many characters are in *this* string dtype, what units are used
  for *this* datetime64, etc.), and you define a bunch of methods to
  do things like convert an object from a Python object to your dtype
  or vice-versa, to copy an array of your dtype from one place to
  another, to cast to and from your new dtype, etc. This part is
  great.

  The problem is, in the current implementation, we don't actually use
  the Python object system to define these classes / attributes /
  methods. Instead, all possible dtypes are jammed into a single
  Python-level class, whose struct has fields for the union of all
  possible dtype's attributes, and instead of Python-style method
  slots there's just a big table of function pointers attached to each
  object.

  So the main proposal is that we keep the basic design, but switch it
  so that the float64 dtype, the int64 dtype, etc. actually literally
  are subclasses of np.dtype, each implementing their own fields and
  Python-style methods.

  Some of the pieces involved in doing this:

  - The current dtype methods should be cleaned up -- e.g. 'dot' and
    'less_than' are both dtype methods, when conceptually they're much
    more like ufuncs.

  - The ufunc inner-loop interface currently does not get a reference
    to the dtype object, so they can't see its attributes and this is
    a big obstacle to many interesting dtypes (e.g., it's hard to
    implement np.equal for categoricals if you don't know what
    categories each has). So we need to add new arguments to the core
    ufunc loop signature. (Fortunately this can be done in a
    backwards-compatible way.)

  - We need to figure out what exactly the dtype methods should be,
    and add them to the dtype class (possibly with backwards
    compatibility shims for anyone who is accessing PyArray_ArrFuncs
    directly).

  - Casting will be possibly the trickiest thing to work out, though
    the basic idea of using dunder-dispatch-like __cast__ and
    __rcast__ methods seems workable. (Encouragingly, this is also
    exactly what dynd also does, though unfortunately dynd does not
    yet support user-defined dtypes even to the extent that numpy
    does, so there isn't much else we can steal from them.)
    - We may also want to rethink the casting rules while we're at it,
      since they have some very weird corners right now (e.g. see
      [https://github.com/numpy/numpy/issues/6240])

  - We need to migrate the current dtypes over to the new system,
    which can be done in stages:

    - First stick them all in a single "legacy dtype" class whose
      methods just dispatch to the PyArray_ArrFuncs per-object "method
      table"

    - Then move each of them into their own classes

  - We should provide a Python-level wrapper for the protocol, so that
    you can call dtype methods from Python

  - And vice-versa, it should be possible to subclass dtype at the
    Python level

  - etc.

  Fortunately, AFAICT pretty much all of this can be done while
  maintaining backwards compatibility (though we may want to break
  some obscure cases to avoid expending *too* much effort with weird
  backcompat contortions that will only help a vanishingly small
  proportion of the userbase), and a lot of the above changes can be
  done as semi-independent mini-projects, so there's no need for some
  branch to go off and spend a year rewriting the world.

  Obviously there are still a lot of details to work out, though. But
  overall, there was widespread agreement that this is one of the #1
  pain points for our users (e.g. it's the single main request from
  pandas), and fixing it is very high priority.

  Some features that would become straightforward to implement
  (e.g. even in third-party libraries) if this were fixed:
  - missing value support
  - physical unit tracking (meters / seconds -> array of velocity;
    meters + seconds -> error)
  - better and more diverse datetime representations (e.g. datetimes
    with attached timezones, or using funky geophysical or
    astronomical calendars)
  - categorical data
  - variable length strings
  - strings-with-encodings (e.g. latin1)
  - forward mode automatic differentiation (write a function that
    computes f(x) where x is an array of float64; pass that function
    an array with a special dtype and get out both f(x) and f'(x))
  - probably others I'm forgetting right now

  I should also note that there was one substantial objection to this
  plan, from Travis Oliphant (in discussions later in the
  conference). I'm not confident I understand his objections well
  enough to reproduce them here, though -- perhaps he'll elaborate.

Money
=====

  There was an extensive discussion on the topic of: "if we had money,
  what would we do with it?"

  This is partially motivated by the realization that there are a
  number of sources that we could probably get money from, if we had a
  good story for what we wanted to do, so it's not just an idle
  question.

  Points of general agreement:

  - Doing the in-person meeting was a good thing. We should plan do
    that again, at least once a year. So one thing to spend money on
    is travel subsidies to make sure that happens and is productive.

  - While it's tempting to imagine hiring junior people for the more
    frustrating/boring work like maintaining buildbots, release
    infrastructure, updating docs, etc., this seems difficult to do
    realistically with our current resources -- how do we hire for
    this, who would manage them, etc.?

  - On the other hand, the general feeling was that if we found the
    money to hire a few more senior people who could take care of
    themselves more, then that would be good and we could
    realistically absorb that extra work without totally unbalancing
    the project.

    - A major open question is how we would recruit someone for a
      position like this, since apparently all the obvious candidates
      who are already active on the NumPy team already have other
      things going on. [For calibration on how hard this can be: NYU
      has apparently had an open position for a year with the job
      description of "come work at NYU full-time with a
      private-industry-competitive-salary on whatever your personal
      open-source scientific project is" (!) and still is having an
      extremely difficult time filling it:
      [http://cds.nyu.edu/research-engineer/]]

    - General consensus though was that there isn't much to be done
      about this though, except try it and see.

    - (By the way, if you're someone who's reading this and
      potentially interested in like a postdoc or better working on
      numpy, then let's talk...)

More specific changes to numpy that had general consensus, but don't
really fit into a high-level roadmap
=========================================================================================================

  - Resolved: we should merge multiarray.so and umath.so into a single
    extension module, so that they can share utility code without the
    current awkward contortions.

  - Resolved: we should start hiding new fields in the ufunc and dtype
    structs as soon as possible going forward. (I.e. they would not be
    present in the version of the structs that are exposed through the
    C API, but internally we would use a more detailed struct.)
    - Mayyyyyybe we should even go ahead and hide the subset of the
      existing fields that are really internal details that no-one
      should be using. If we did this without changing anything else
      then it would preserve ABI (the fields would still be where
      existing compiled extensions expect them to be, if any such
      extensions exist) while breaking API (trying to compile such
      extensions would give a clear error), so would be a smoother
      ramp if we think we need to eventually break those fields for
      real. (As discussed above, there are a bunch of fields in the
      dtype base class that only make sense for specific dtype
      subclasses, e.g. only record dtypes need a list of field names,
      but right now all dtypes have one anyway. So it would be nice to
      remove these from the base class entirely, but that is
      potentially ABI-breaking.)

  - Resolved: np.array should never return an object array unless
    explicitly requested (e.g. with dtype=object); it just causes too
    many surprising problems.
    - First step: add a deprecation warning
    - Eventually: make it an error.

  - The matrix class
    - Resolved: We won't add warnings yet, but we will prominently
      document that it is deprecated and should be avoided where-ever
      possible.
      - Stéfan van der Walt volunteers to do this.
    - We'd all like to deprecate it properly, but the feeling was that
      the precondition for this is for scipy.sparse to provide sparse
      "arrays" that don't return np.matrix objects on ordinary
      operatoins. Until that happens we can't reasonably tell people
      that using np.matrix is a bug.

  - Resolved: we should add a similar prominent note to the
    "subclassing ndarray" documentation, warning people that this is
    painful and barely works and please don't do it if you have any
    alternatives.

  - Resolved: we want more, smaller releases -- every 6 months at
    least, aiming to go even faster (every 4 months?)

  - On the question of using Cython inside numpy core:
    - Everyone agrees that there are places where this would be an
      improvement (e.g., Python<->C interfaces, and places "when you
      want to do computer science", e.g. complicated algorithmic stuff
      like graph traversals)
    - Chuck wanted it to be clear though that he doesn't think it
      would be a good goal to try and rewrite all of numpy in Cython
      -- there also exist places where Cython ends up being "an uglier
      version of C". No-one disagreed.

  - Our text reader is apparently not very functional on Python 3, and
    generally slow and hard to work with.
    - Resolved: We should extract Pandas's awesome text reader/parser
      and convert it into its own package, that could then become a
      new backend for both pandas and numpy.loadtxt.
    - Jeff thinks this is a great idea
    - Thomas Caswell volunteers to do the extraction.

  - We should work on improving our tools for evolving the ABI, so
    that we will eventually be less constrained by decisions made
    decades ago.
    - One idea that had a lot of support was to switch from our
      current append-only C-API to a "sliding window" API based on
      explicit versions. So a downstream package might say

      #define NUMPY_API_VERSION 4

      and they'd get the functions and behaviour provided in "version
      4" of the numpy C api. If they wanted to get access to new stuff
      that was added in version 5, then they'd need to switch that
      #define, and at the same time clean up any usage of stuff that
      was removed or changed in version 5. And to provide a smooth
      migration path, one version of numpy would support multiple
      versions at once, gradually deprecating and dropping old
      versions.

    - If anyone wants to help bring pip up to scratch WRT tracking ABI
      dependencies (e.g., 'pip install numpy==<version with new ABI>'
      -> triggers rebuild of scipy against the new ABI), then that
      would be an extremely useful thing.

Policies that should be documented
==================================

  ...together with some notes about what the contents of the document
  should be:

How we manage bugs in the bug tracker.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

  - Github "milestones" should *only* be assigned to release-blocker
    bugs (which mostly means "regression from the last release").

    In particular, if you're tempted to push a bug forward to the next
    release... then it's clearly not a blocker, so don't set it to the
    next release's milestone, just remove the milestone entirely.

    (Obvious exception to this: deprecation followup bugs where we
    decide that we want to keep the deprecation around a bit longer
    are a case where a bug actually does switch from being a blocker
    to release 1.x to being a blocker for release 1.(x+1).)

  - Don't hesitate to close an issue if there's no way forward --
    e.g. a PR where the author has disappeared. Just post a link to
    this policy and close, with a polite note that we need to keep our
    tracker useful as a todo list, but they're welcome to re-open if
    things change.

Deprecations and breakage policy:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

  - How long do we need to keep DeprecationWarnings around before we
    break things? This is tricky because on the one hand an aggressive
    (short) deprecation period lets us deliver new features and
    important cleanups more quickly, but on the other hand a
    too-aggressive deprecation period is difficult for our more
    conservative downstream users.

    - Idea that had the most support: pick a somewhat-aggressive
      warning period as our default, and make a rule that if someone
      asks for an extension during the beta cycle for the release that
      removes it, then we put it back for another release or two worth
      of grace period. (While also possibly upgrading the warning to
      be more visible during the grace period.) This gives us
      deprecation periods that are more adaptive on a case-by-case
      basis.

  - Lament: it would be really nice if we could get more people to
    test our beta releases, because in practice right now 1.x.0 ends
    up being where we actually the discover all the bugs, and 1.x.1 is
    where it actually becomes usable. Which sucks, and makes it
    difficult to have a solid policy about what counts as a
    regression, etc. Is there anything we can do about this?

  - ABI breakage: we distinguish between an ABI break that breaks
    everything (e.g., "import scipy" segfaults), versus an ABI break
    that breaks an occasional rare case (e.g., only apps that poke
    around in some obscure corner of some struct are affected).

    - The "break-the-world" type remains off-limit for now: the pain
      is still too large (conda helps, but there are lots of people
      who don't use conda!), and there aren't really any compelling
      improvements that this would enable anyway.

    - For the "break-0.1%-of-users" type, it is *not* ruled out by
      fiat, though we remain conservative: we should treat it like
      other API breaks in principle, and do a careful case-by-case
      analysis of the details of the situation, taking into account
      what kind of code would be broken, how common these cases are,
      how important the benefits are, whether there are any specific
      mitigation strategies we can use, etc. -- with this process of
      course taking into account that a segfault is nastier than a
      Python exception.

Other points that were discussed
================================

  - There was inconclusive discussion of what we should do with dot()
    in the places where it disagrees with the PEP 465 matmul semantics
    (specifically this is when both arguments have ndim >= 3, or one
    argument has ndim == 0).
    - The concern is that the current behavior is not very useful, and
      as far as we can tell no-one is using it; but, as people get
      used to the more-useful PEP 465 behavior, they will increasingly
      try to use it on the assumption that np.dot will work the same
      way, and this will create pain for lots of people. So Nathaniel
      argued that we should start at least issuing a visible warning
      when people invoke the corner-case behavior.
    - But OTOH, np.dot is such a core piece of infrastructure, and
      there's such a large landscape of code out there using numpy
      that we can't see, that others were reasonably wary of making
      any change.
    - For now: document prominently, but no change in behavior.

Links to raw notes
==================

  Main page:
  [https://github.com/numpy/numpy/wiki/SciPy-2015-developer-meeting]

  Notes from the meeting proper:
  [https://docs.google.com/document/d/1IJcYdsHtk8MVAM4AZqFDBSf_nVG-mrB4Tv2bh9u1...]

  Slides from the followup BoF:
  [https://gist.github.com/njsmith/eb42762054c88e810786/raw/b74f978ce10a972831c...]

  Notes from the followup BoF:
  [https://docs.google.com/document/d/11AuTPms5dIPo04JaBOWEoebXfk-tUzEZ-CvFnLIt...]

-n

-- 
Nathaniel J. Smith -- http://vorpus.org

Notes from the numpy dev meeting at scipy 2015

Marten van Kerkwijk

Marten van Kerkwijk

Marten van Kerkwijk

Marten van Kerkwijk

tags

participants (15)