These are the notes from the NumPy dev meeting held July 7, 2015, at
the SciPy conference in Austin, presented here so the list can keep up
with what happens, and so you can give feedback. Please do give
feedback, none of this is final!
(Also, if anyone who was there notices anything I left out or
mischaracterized, please speak up -- these are a lot of notes I'm
trying to gather together, so I could easily have missed something!)
Thanks to Jill Cowan and the rest of the SciPy organizers for donating
space and organizing logistics for us, and to the Berkeley Institute
for Data Science for funding travel for Jaime, Nathaniel, and
Present in the room for all or part: Daniel Allan, Chris Barker,
Sebastian Berg, Thomas Caswell, Jeff Reback, Jaime Fernández del
Río, Chuck Harris, Nathaniel Smith, Stéfan van der Walt. (Note: I'm
pretty sure this list is incomplete)
Joining remotely for all or part: Stephan Hoyer, Julian Taylor.
Formalizing our governance/decision making
This was a major focus of discussion. At a high level, the consensus
was to steal IPython's governance document ("IPEP 29") and modify it
to remove its use of a BDFL as a "backstop" to normal community
consensus-based decision, and replace it with a new "backstop" based
on Apache-project-style consensus voting amongst the core team.
I'll send out a proper draft of this shortly for further discussion.
Let's assume NumPy is going to remain important indefinitely, and
try to make it better, instead of waiting for something better to
come along. (This is unlikely to be wasted effort even if something
better does come along, and it's hardly a sure thing that that will
Let's focus on evolving numpy as far as we can without major
break-the-world changes (no "numpy 2.0", at least in the foreseeable
And, as a target for that evolution, let's change our focus from
numpy as "NumPy is the library that gives you the np.ndarray object
(plus some attached infrastructure)", to "NumPy provides the
standard framework for working with arrays and array-like objects in
This means, creating defined interfaces between array-like objects /
ufunc objects / dtype objects, so that it becomes possible for third
parties to add their own and mix-and-match. Right now ufuncs are
pretty good at this, but if you want a new array class or dtype then
in most cases you pretty much have to modify numpy itself.
Vision: instead of everyone who wants a new container type having to
reimplement all of numpy, Alice can implement an array class using
(sparse / distributed / compressed / tiled / gpu / out-of-core /
delayed / ...) storage, pass it to code that was written using
direct calls to np.* functions, and it just works. (Instead of
np.sin being "the way you calculate the sine of an ndarray", it's
"the way you calculate the sine of any array-like container
Vision: Darryl can implement a new dtype for (categorical data /
astronomical dates / integers-with-missing-values / ...) without
having to touch the numpy core.
Vision: Chandni can then come along and combine them by doing
a = alice_array([...], dtype=darryl_dtype)
and it just works.
Vision: no-one is tempted to subclass ndarray, because anything you
can do with an ndarray subclass you can also easily do by defining
your own new class that implements the "array protocol".
Supporting third-party array types
- Get __numpy_ufunc__ done, which will cover a good chunk of numpy's
API right there.
- Go through the rest of the stuff in numpy, and figure out some
story for how to let it handle third-party array classes:
- ufunc ALL the things: Some things can be converted directly into
(g)ufuncs and then use __numpy_ufunc__ (e.g., np.std); some
things could be converted into (g)ufuncs if we extended the
(g)ufunc interface a bit (e.g. np.sort, np.matmul).
- Some things probably need their own __numpy_ufunc__-like
- Provide tools to make it easier to implement the more complicated
parts of an array object (e.g. the bazillion different methods,
many of which are ufuncs in disguise, or indexing)
- Longer-run interesting research project: __numpy_ufunc__ requires
that one or the other object have explicit knowledge of how to
handle the other, so to handle binary ufuncs with N array types
you need something like N**2 __numpy_ufunc__ code paths. As an
alternative, if there were some interface that an object could
export that provided the operations nditer needs to efficiently
iterate over (chunks of) it, then you would only need N
implementations of this interface to handle all N**2 operations.
This would solve a lot of problems for projects like:
Supporting third-party dtypes
We already have something like a C level "dtype
protocol". Conceptually, the way you define a new dtype is by
defining a new class whose instances have data attributes defining
the parameters of the dtype (what fields are in *this* record dtype,
how many characters are in *this* string dtype, what units are used
for *this* datetime64, etc.), and you define a bunch of methods to
do things like convert an object from a Python object to your dtype
or vice-versa, to copy an array of your dtype from one place to
another, to cast to and from your new dtype, etc. This part is
The problem is, in the current implementation, we don't actually use
the Python object system to define these classes / attributes /
methods. Instead, all possible dtypes are jammed into a single
Python-level class, whose struct has fields for the union of all
possible dtype's attributes, and instead of Python-style method
slots there's just a big table of function pointers attached to each
So the main proposal is that we keep the basic design, but switch it
so that the float64 dtype, the int64 dtype, etc. actually literally
are subclasses of np.dtype, each implementing their own fields and
Some of the pieces involved in doing this:
- The current dtype methods should be cleaned up -- e.g. 'dot' and
'less_than' are both dtype methods, when conceptually they're much
more like ufuncs.
- The ufunc inner-loop interface currently does not get a reference
to the dtype object, so they can't see its attributes and this is
a big obstacle to many interesting dtypes (e.g., it's hard to
implement np.equal for categoricals if you don't know what
categories each has). So we need to add new arguments to the core
ufunc loop signature. (Fortunately this can be done in a
- We need to figure out what exactly the dtype methods should be,
and add them to the dtype class (possibly with backwards
compatibility shims for anyone who is accessing PyArray_ArrFuncs
- Casting will be possibly the trickiest thing to work out, though
the basic idea of using dunder-dispatch-like __cast__ and
__rcast__ methods seems workable. (Encouragingly, this is also
exactly what dynd also does, though unfortunately dynd does not
yet support user-defined dtypes even to the extent that numpy
does, so there isn't much else we can steal from them.)
- We may also want to rethink the casting rules while we're at it,
since they have some very weird corners right now (e.g. see
- We need to migrate the current dtypes over to the new system,
which can be done in stages:
- First stick them all in a single "legacy dtype" class whose
methods just dispatch to the PyArray_ArrFuncs per-object "method
- Then move each of them into their own classes
- We should provide a Python-level wrapper for the protocol, so that
you can call dtype methods from Python
- And vice-versa, it should be possible to subclass dtype at the
Fortunately, AFAICT pretty much all of this can be done while
maintaining backwards compatibility (though we may want to break
some obscure cases to avoid expending *too* much effort with weird
backcompat contortions that will only help a vanishingly small
proportion of the userbase), and a lot of the above changes can be
done as semi-independent mini-projects, so there's no need for some
branch to go off and spend a year rewriting the world.
Obviously there are still a lot of details to work out, though. But
overall, there was widespread agreement that this is one of the #1
pain points for our users (e.g. it's the single main request from
pandas), and fixing it is very high priority.
Some features that would become straightforward to implement
(e.g. even in third-party libraries) if this were fixed:
- missing value support
- physical unit tracking (meters / seconds -> array of velocity;
meters + seconds -> error)
- better and more diverse datetime representations (e.g. datetimes
with attached timezones, or using funky geophysical or
- categorical data
- variable length strings
- strings-with-encodings (e.g. latin1)
- forward mode automatic differentiation (write a function that
computes f(x) where x is an array of float64; pass that function
an array with a special dtype and get out both f(x) and f'(x))
- probably others I'm forgetting right now
I should also note that there was one substantial objection to this
plan, from Travis Oliphant (in discussions later in the
conference). I'm not confident I understand his objections well
enough to reproduce them here, though -- perhaps he'll elaborate.
There was an extensive discussion on the topic of: "if we had money,
what would we do with it?"
This is partially motivated by the realization that there are a
number of sources that we could probably get money from, if we had a
good story for what we wanted to do, so it's not just an idle
Points of general agreement:
- Doing the in-person meeting was a good thing. We should plan do
that again, at least once a year. So one thing to spend money on
is travel subsidies to make sure that happens and is productive.
- While it's tempting to imagine hiring junior people for the more
frustrating/boring work like maintaining buildbots, release
infrastructure, updating docs, etc., this seems difficult to do
realistically with our current resources -- how do we hire for
this, who would manage them, etc.?
- On the other hand, the general feeling was that if we found the
money to hire a few more senior people who could take care of
themselves more, then that would be good and we could
realistically absorb that extra work without totally unbalancing
- A major open question is how we would recruit someone for a
position like this, since apparently all the obvious candidates
who are already active on the NumPy team already have other
things going on. [For calibration on how hard this can be: NYU
has apparently had an open position for a year with the job
description of "come work at NYU full-time with a
private-industry-competitive-salary on whatever your personal
open-source scientific project is" (!) and still is having an
extremely difficult time filling it:
- General consensus though was that there isn't much to be done
about this though, except try it and see.
- (By the way, if you're someone who's reading this and
potentially interested in like a postdoc or better working on
numpy, then let's talk...)
More specific changes to numpy that had general consensus, but don't
really fit into a high-level roadmap
- Resolved: we should merge multiarray.so and umath.so into a single
extension module, so that they can share utility code without the
current awkward contortions.
- Resolved: we should start hiding new fields in the ufunc and dtype
structs as soon as possible going forward. (I.e. they would not be
present in the version of the structs that are exposed through the
C API, but internally we would use a more detailed struct.)
- Mayyyyyybe we should even go ahead and hide the subset of the
existing fields that are really internal details that no-one
should be using. If we did this without changing anything else
then it would preserve ABI (the fields would still be where
existing compiled extensions expect them to be, if any such
extensions exist) while breaking API (trying to compile such
extensions would give a clear error), so would be a smoother
ramp if we think we need to eventually break those fields for
real. (As discussed above, there are a bunch of fields in the
dtype base class that only make sense for specific dtype
subclasses, e.g. only record dtypes need a list of field names,
but right now all dtypes have one anyway. So it would be nice to
remove these from the base class entirely, but that is
- Resolved: np.array should never return an object array unless
explicitly requested (e.g. with dtype=object); it just causes too
many surprising problems.
- First step: add a deprecation warning
- Eventually: make it an error.
- The matrix class
- Resolved: We won't add warnings yet, but we will prominently
document that it is deprecated and should be avoided where-ever
- Stéfan van der Walt volunteers to do this.
- We'd all like to deprecate it properly, but the feeling was that
the precondition for this is for scipy.sparse to provide sparse
"arrays" that don't return np.matrix objects on ordinary
operatoins. Until that happens we can't reasonably tell people
that using np.matrix is a bug.
- Resolved: we should add a similar prominent note to the
"subclassing ndarray" documentation, warning people that this is
painful and barely works and please don't do it if you have any
- Resolved: we want more, smaller releases -- every 6 months at
least, aiming to go even faster (every 4 months?)
- On the question of using Cython inside numpy core:
- Everyone agrees that there are places where this would be an
improvement (e.g., Python<->C interfaces, and places "when you
want to do computer science", e.g. complicated algorithmic stuff
like graph traversals)
- Chuck wanted it to be clear though that he doesn't think it
would be a good goal to try and rewrite all of numpy in Cython
-- there also exist places where Cython ends up being "an uglier
version of C". No-one disagreed.
- Our text reader is apparently not very functional on Python 3, and
generally slow and hard to work with.
- Resolved: We should extract Pandas's awesome text reader/parser
and convert it into its own package, that could then become a
new backend for both pandas and numpy.loadtxt.
- Jeff thinks this is a great idea
- Thomas Caswell volunteers to do the extraction.
- We should work on improving our tools for evolving the ABI, so
that we will eventually be less constrained by decisions made
- One idea that had a lot of support was to switch from our
current append-only C-API to a "sliding window" API based on
explicit versions. So a downstream package might say
#define NUMPY_API_VERSION 4
and they'd get the functions and behaviour provided in "version
4" of the numpy C api. If they wanted to get access to new stuff
that was added in version 5, then they'd need to switch that
#define, and at the same time clean up any usage of stuff that
was removed or changed in version 5. And to provide a smooth
migration path, one version of numpy would support multiple
versions at once, gradually deprecating and dropping old
- If anyone wants to help bring pip up to scratch WRT tracking ABI
dependencies (e.g., 'pip install numpy==<version with new ABI>'
-> triggers rebuild of scipy against the new ABI), then that
would be an extremely useful thing.
Policies that should be documented
...together with some notes about what the contents of the document
How we manage bugs in the bug tracker.
- Github "milestones" should *only* be assigned to release-blocker
bugs (which mostly means "regression from the last release").
In particular, if you're tempted to push a bug forward to the next
release... then it's clearly not a blocker, so don't set it to the
next release's milestone, just remove the milestone entirely.
(Obvious exception to this: deprecation followup bugs where we
decide that we want to keep the deprecation around a bit longer
are a case where a bug actually does switch from being a blocker
to release 1.x to being a blocker for release 1.(x+1).)
- Don't hesitate to close an issue if there's no way forward --
e.g. a PR where the author has disappeared. Just post a link to
this policy and close, with a polite note that we need to keep our
tracker useful as a todo list, but they're welcome to re-open if
Deprecations and breakage policy:
- How long do we need to keep DeprecationWarnings around before we
break things? This is tricky because on the one hand an aggressive
(short) deprecation period lets us deliver new features and
important cleanups more quickly, but on the other hand a
too-aggressive deprecation period is difficult for our more
conservative downstream users.
- Idea that had the most support: pick a somewhat-aggressive
warning period as our default, and make a rule that if someone
asks for an extension during the beta cycle for the release that
removes it, then we put it back for another release or two worth
of grace period. (While also possibly upgrading the warning to
be more visible during the grace period.) This gives us
deprecation periods that are more adaptive on a case-by-case
- Lament: it would be really nice if we could get more people to
test our beta releases, because in practice right now 1.x.0 ends
up being where we actually the discover all the bugs, and 1.x.1 is
where it actually becomes usable. Which sucks, and makes it
difficult to have a solid policy about what counts as a
regression, etc. Is there anything we can do about this?
- ABI breakage: we distinguish between an ABI break that breaks
everything (e.g., "import scipy" segfaults), versus an ABI break
that breaks an occasional rare case (e.g., only apps that poke
around in some obscure corner of some struct are affected).
- The "break-the-world" type remains off-limit for now: the pain
is still too large (conda helps, but there are lots of people
who don't use conda!), and there aren't really any compelling
improvements that this would enable anyway.
- For the "break-0.1%-of-users" type, it is *not* ruled out by
fiat, though we remain conservative: we should treat it like
other API breaks in principle, and do a careful case-by-case
analysis of the details of the situation, taking into account
what kind of code would be broken, how common these cases are,
how important the benefits are, whether there are any specific
mitigation strategies we can use, etc. -- with this process of
course taking into account that a segfault is nastier than a
Other points that were discussed
- There was inconclusive discussion of what we should do with dot()
in the places where it disagrees with the PEP 465 matmul semantics
(specifically this is when both arguments have ndim >= 3, or one
argument has ndim == 0).
- The concern is that the current behavior is not very useful, and
as far as we can tell no-one is using it; but, as people get
used to the more-useful PEP 465 behavior, they will increasingly
try to use it on the assumption that np.dot will work the same
way, and this will create pain for lots of people. So Nathaniel
argued that we should start at least issuing a visible warning
when people invoke the corner-case behavior.
- But OTOH, np.dot is such a core piece of infrastructure, and
there's such a large landscape of code out there using numpy
that we can't see, that others were reasonably wary of making
- For now: document prominently, but no change in behavior.
Links to raw notes
Notes from the meeting proper:
Slides from the followup BoF:
Notes from the followup BoF:
Nathaniel J. Smith -- http://vorpus.org
There's been some work going on recently on Py2 vs Py3 object comparisons.
If you want all the background, see gh-6265
<https://github.com/numpy/numpy/issues/6265> and follow the links there.
There is a half baked PR in the works, gh-6269
<https://github.com/numpy/numpy/pull/6269>, that tries to unify behavior
and fix some bugs along the way, by replacing all 2.x uses of
PyObject_Compare with several calls to PyObject_RichCompareBool, which is
available on 2.6, the oldest Python version we support.
The poster child for this example is computing np.sign on an object array
that has an np.nan entry. 2.x will just make up an answer for us:
>>> cmp(np.nan, 0)
even though none of the relevant compares succeeds:
>>> np.nan < 0
>>> np.nan > 0
>>> np.nan == 0
The current 3.x is buggy, so the fact that it produces the same made up
result as in 2.x is accidental:
>>> np.sign(np.array([np.nan], 'O'))
Looking at the code, it seems that the original intention was for the
answer to be `0`, which is equally made up but perhaps makes a little more
There are three ways of fixing this that I see:
1. Arbitrarily choose a value to set the return to. This is equivalent
to choosing a default return for `cmp` for comparisons. This preserves
behavior, but feels wrong.
2. Similarly to how np.sign of a floating point array with nans returns
nan for those values, return e,g, None for these cases. This is my
3. Raise an error, along the lines of the TypeError: unorderable types
that 3.x produces for some comparisons.
( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes
de dominación mundial.
My search engine was not able to help me on this one, possibly because I
don't know exactly *what* I am looking for.
I need to override __getitem__ for a class that wrapps a numpy array. I
know the dimensions of my array (which can be variable from instance to
instance), and I know what I want to do: for one preselected dimension,
I need to select another slice than requested by the user, do something
with the data, and return the variable.
I am looking for a function that helps me to "clean" the input of
__getitem__. There are so many possible cases, when the user uses [:] or
[..., 1:2] or [0, ..., :] and so forth. But all these cases have an
equivalent index array of len(ndimensions) with only valid slice()
objects in it. This array would be much easier for me to work with.
in pseudo code:
def __getitem__(self, item):
# clean input
item = np.clean_item(item, ndimensions=4)
# Ok now item is guaranteed to be of len 4
item = slice()
Is there such a function in numpy?
I hope I have been clear enough... Thanks a lot!
We've found that NumPy uses the local TZ for printing datetime64 timestamps:
In : t = datetime.utcnow()
In : print t
In : np.array([t], dtype="datetime64[s]")
Out: array(['2015-08-26T13:52:10+0200'], dtype='datetime64[s]')
Googling for a way to print UTC out of the box, the best thing I could find
In : [str(i.item()) for i in np.array([t], dtype="datetime64[s]")]
Out: ['2015-08-26 11:52:10']
Now, is there a better way to specify that I want the datetimes printed
always in UTC?
I am a numpy newbie.
I have two wav files, one that numpy takes a long time to process the FFT. They was created within audacity using white noise and silence for gaps.
The files are very similar in the following way;
1. is white noise with silence gaps on every 15 second interval.2. is 1. but slightly shorter, i.e. I trimmed some ms off the end but it still has the last gap at 60s.
The code I am using processes the file like this;
framerate, data = scipy.io.wavfile.read(filepath) right = data[:, 0] # Align it to be efficient. if len(right) % 2 != 0: right = right[range(len(right) - 1)] noframes = len(right) fftout = np.fft.fft(right) / noframes # <<< I am timing this cmd
my_1_minute_noise_with_gaps_truncated took 30.75620985s to process.my_1_minute_noise_with_gaps took 22307.13917s to process.
Could someone tell me why this behaviour is happening please?
Sorry I can't attach the files as this email gets bounced but you could easily create the files yourself.E.g my last gap width is 59.9995 - 1:00.0005, I repeat this every 15 seconds.My truncated file is 1:00.0015s long, non-truncated is 1:00.0833s long
Reading Nathaniel summary from the numpy dev meeting, it looks like there
is a consensus on using cython in numpy for the Python-C interfaces.
This has been on my radar for a long time: that was one of my rationale for
splitting multiarray into multiple "independent" .c files half a decade
ago. I took the opportunity of EuroScipy sprints to look back into this,
but before looking more into it, I'd like to make sure I am not going
1. The transition has to be gradual
2. The obvious way I can think of allowing cython in multiarray is
modifying multiarray such as cython "owns" the PyMODINIT_FUNC and the
module PyModuleDef table.
3. We start using cython for the parts that are mostly menial refcount
work. Things like functions in calculation.c are obvious candidates.
Step 2 should not be disruptive, and does not look like a lot of work:
there are < 60 methods in the table, and most of them should be fairly
straightforward to cythonize. At worse, we could just keep them as is
outside cython and just "export" them in cython.
Does that sound like an acceptable plan ?
If so, I will start working on a PR to work on 2.