[Numpy-discussion] Advanced indexing: "fancy" vs. orthogonal

Sat Apr 4 04:54:33 EDT 2015

On Sat, Apr 4, 2015 at 12:17 AM, Ralf Gommers <ralf.gommers at gmail.com> wrote:
>
>
> On Sat, Apr 4, 2015 at 1:54 AM, Nathaniel Smith <njs at pobox.com> wrote:
>>
>>
>> But, the real problem here is that we have two different array duck
>> types that force everyone to write their code twice. This is a
>> terrible state of affairs! (And exactly analogous to the problems
>> caused by np.ndarray disagreeing with np.matrix & scipy.sparse about
>> the the proper definition of *, which PEP 465 may eventually
>> alleviate.) IMO we should be solving this indexing problem directly,
>> not applying bandaids to its symptoms, and the way to do that is to
>> come up with some common duck type that everyone can agree on.
>>
>> Unfortunately, AFAICT this means our only options here are to have
>> some kind of backcompat break in numpy, some kind of backcompat break
>> in pandas, or to do nothing and continue indefinitely with the status
>> quo where the same indexing operation might silently return different
>> results depending on the types passed in. All of these options have
>> real costs for users, and it isn't at all clear to me what the
>> relative costs will be when we dig into the details of our various
>> options.
>
>
> I doubt that there is a reasonable way to quantify those costs, especially
> those of breaking backwards compatibility. If someone has a good method, I'd
> be interested though.

I'm a little nervous about how easily this argument might turn into
"either A or B is better but we can't be 100% *certain* which it is so
instead of doing our best using the data available we should just
choose B." Being a maintainer means accepting uncertainty and doing
our best anyway.

But that said I'm still totally on board with erring on the side of
caution (in particular, you can never go back and *un*break
backcompat). An obvious challenge to anyone trying to take this
forward (in any direction!) would definitely be to gather the most
useful data possible. And it's not obviously impossible -- maybe one
could do something useful by scanning ASTs of lots of packages (I have
a copy of pypi if anyone wants it, that I downloaded with the idea of
making some similar arguments for why core python should slightly
break backcompat to allow overloading of a < b < c syntax), or adding
instrumentation to numpy, or running small-scale usability tests, or
surveying people, or ...

(I was pretty surprised by some of the data gathered during the PEP
465 process, e.g. on how common dot() calls are relative to existing
built-in operators, and on its associativity in practice.)

>>
>> So I'd be very happy to see worked out proposals for any or
>> all of these approaches. It strikes me as really premature to be
>> issuing proclamations about what changes might be considered. There is
>> really no danger to *considering* a proposal;
>
>
> Sorry, I have to disagree. Numpy is already seen by some as having a poor
> track record on backwards compatibility. Having core developers say "propose
> some backcompat break to how indexing works and we'll consider it" makes our
> stance on that look even worse. Of course everyone is free to make any
> technical proposal they deem fit and we'll consider the merits of it.
> However I'd like us to be clear that we do care strongly about backwards
> compatibility and that the fundamentals of the core of Numpy (things like
> indexing, broadcasting, dtypes and ufuncs) will not be changed in
> backwards-incompatible ways.
>
> Ralf
>
> P.S. also not for a possible numpy 2.0 (or have we learned nothing from
> Python3?).

I agree 100% that we should and do care strongly about backwards
compatibility. But you're saying in one sentence that we should tell
people that we won't consider backcompat breaks, and then in the next
sentence that of course we actually will consider them (even if we
almost always reject them). Basically, I think saying one thing and
doing another is not a good way to build people's trust.

Core python broke backcompat on a regular basis throughout the python
2 series, and almost certainly will again -- the bar to doing so is
*very* high, and they use elaborate mechanisms to ease the way
(__future__, etc.), but they do it. A few months ago there was even
some serious consideration given to changing py3 bytestring indexing
to return bytestrings instead of integers. (Consensus was
unsurprisingly that this was a bad idea, but there were core devs
seriously exploring it, and no-one complained about the optics.)

It's true that numpy has something of a bad reputation in this area,
and I think it's because until ~1.7 or so, we randomly broke stuff by
accident on a pretty regular basis, even in "bug fix" releases. I
think the way to rebuild that trust is to honestly say to our users
that when we do break backcompat, we will never do it by accident, and
we will do it only rarely, after careful consideration, with the
smoothest transition possible, only in situations where we are
convinced that it the net best possible solution for our users, and
only after public discussion and getting buy-in from stakeholders
(e.g. major projects affected). And then follow through on that to the
best of our ability. We've certainly gotten a lot better at this over
the last few years.

If we say we'll *never* break backcompat then we'll inevitably end up
convincing some people that we're liars, just because one person's
bugfix is another's backcompat break. (And they're right, it is a
backcompat break; it's just one where the benefits of the fix
obviously outweigh the cost of the break.) Or we could actually avoid
breaking backcompat by descending into Knuth-style stasis... but even
there notice that none of us are actually using Knuth's TeX, we all
use forks like XeTeX that have further changes added, which goes to
show how futile this would be.

In particular, I'd *not* willingly say that we'll never incompatibly
change the core pieces of numpy, b/c I'm personally convinced that
rewriting how e.g. dtypes work could be a huge win with minimal
real-world breakage -- even though technically there's practically
nothing we can touch there without breaking backcompat to some extent
b/c dtype structs are all public, including even silly things like the
ad hoc, barely-used refcounting system. OTOH I'm happy to say that we
won't incompatibly change the core of how dtypes work except in ways
that make the userbase glad that we did. How's that? :-)

-n

-- 
Nathaniel J. Smith -- http://vorpus.org