[Numpy-discussion] What is consensus anyway

Tue Apr 24 20:56:27 EDT 2012

On Tue, Apr 24, 2012 at 2:14 PM, Charles R Harris
<charlesr.harris at gmail.com> wrote:
>
>
> On Mon, Apr 23, 2012 at 11:35 PM, Fernando Perez <fperez.net at gmail.com>
> wrote:
>>
>> On Mon, Apr 23, 2012 at 8:49 PM, Stéfan van der Walt <stefan at sun.ac.za>
>> wrote:
>> > If you are referring to the traditional concept of a fork, and not to
>> > the type we frequently make on GitHub, then I'm surprised that no one
>> > has objected already.  What would a fork solve? To paraphrase the
>> > regexp saying: after forking, we'll simply have two problems.
>>
>> I concur with you here: github 'forks', yes, as many as possible!
>> Hopefully every one of those will produce one or more PRs :)  But a
>> fork in the sense of a divergent parallel project?  I think that would
>> only be indicative of a complete failure to find a way to make
>> progress here, and I doubt we're anywhere near that state.
>>
>> That forks are *possible* is indeed a valuable and important option in
>> open source software, because it means that a truly dysfunctional
>> original project team/direction can't hold a community hostage
>> forever.  But that doesn't mean that full-blown forks should be
>> considered lightly, as they also carry enormous costs.
>>
>> I see absolutely nothing in the current scenario to even remotely
>> consider that a full-blown fork would be a good idea, and I hope I'm
>> right.  It seems to me we're making progress on problems that led to
>> real difficulties last year, but from multiple parties I see signs
>> that give me reason to be optimistic that the project is getting
>> better, not worse.
>>
>
> We certainly aren't there at the moment, but I can see us heading that way.
> But let's back up a bit. Numpy 1.6.0 came out just about 1 year ago. Since
> then datetime, NA, polynomial work, and various other enhancements have gone
> in along with some 280 bug fixes. The major technical problem blocking a 1.7
> release is getting datetime working reliably on windows. So I think that is
> where the short term effort needs to be. Meanwhile, we are spending effort
> to get out a 1.6.2 just so people can work with a stable version with some
> of the bug fixes, and potentially we will spend more time and effort to pull
> out the NA code. In the future there may be a transition to C++ and
> eventually a break with the current ABI. Or not.
>
> There are at least two motivations that get folks to write code for open
> source projects, scratching an itch and money. Money hasn't been a big part
> of the Numpy picture so far, so that leaves scratching an itch. One of the
> attractions of Numpy is that it is a small project, BSD licensed, and not
> overburdened with governance and process. This makes scratching an itch not
> as difficult as it would be in a large project. If Numpy remains a small
> project but acquires the encumbrances of a big project much of that
> attraction will be lost. Momentum and direction also attracts people, but
> numpy is stalled at the moment as the whole NA thing circles around once
> again.

I don't think we need a fork, or to start maintaining separate stable
and unstable trees, or any of the other complicated process changes
that have been suggested. There are tons of projects that routinely
make much bigger changes than we're talking about, and they do it
without needing that kind of overhead. I know that these suggestions
are all made in good faith, but they remind me of a line from that
Apache page I linked earlier: "People tend to avoid conflict and
thrash around looking for something to substitute - somebody in
charge, a rule, a process, stagnation. None of these tend to be very
good substitutes for doing the hard work of resolving the conflict."

I also think if you talk to potential contributors, you'll find that
clear, simple processes and a history of respecting everyone's input
are much more attractive than a no-rules free-for-all. Good
engineering practices are not an "encumbrance". Resolving conflicts
before merging is a good engineering practice.

What happened with the NA discussion is this:
  - There was substantial disagreement about whether NEP-style masks,
or indeed, focusing on a mask-based implementation *at all*, was the
best way forward.
  - There was also a perceived time constraint, that we had to either
implement something immediately while Mark was there, or have nothing.

So in the end, the latter concern outweighed the former, the
discussion was cut off, and Mark's best guess at an API was merged
into master. I totally understand how this decision made sense at the
time, but the result is what we see now: it's left numpy stalled,
rifts on the mailing list, boring discussions about process, and still
no agreement about whether NEP-style masks will actually solve our
users' problems.

Getting past this isn't *complicated* -- it's just "hard work".

> What would I suggest as a way forward with the NA option. Let's take the
> issues.
>
> 1) Adding slots to PyArrayObject_fields. I don't think this is likely to be
> a problem unless someone's code passes the struct by value or uses
> assignment to initialize a statically allocated instance. I'm not saying no
> one does that, low level scientific code can contain all sorts of bizarre
> and astonishing constructs and it is also possible that these sort of things
> might turn up in an old FORTRAN program. The question here is whether to
> allow any changes at all, and I think we will have to in the future. Given
> that, consistent use of accessors will make later changes to the
> organization or implementation of the base structure transparent. Numpy
> itself now uses accessors for the heritage slots, but not for the new NA
> slots. So I suggest at a minimum adding accessors for the maskna_dtype,
> maskna_data, and maskna_strides. Of course, later removing these slots will
> still remain a problem.
>
> 2) NA. This breaks down into API and implementation issues. Personally, I
> think marking the NA stuff experimental leaves room to modify both and would
> prefer to go with what we have and change it into whatever looks best by
> modification through pull requests. This kicks the can down the road, but
> not so far that people sufficiently interested in working on the topic can't
> get modifications in. My own preferences for future API modifications are as
> follows.
>
> a) All arrays should be implicitly masked, even if the mask isn't initially
> allocated. The maskna keyword can then be removed, taking with it the sense
> that there are two kinds of arrays.
>
> b) There needs to be a distinction between missing and ignore. The mechanism
> for this is already in place in the payload type, although it isn't clear to
> me that that is uniformly used in all the NA code. There is also a place for
> missing *and* ignored. Which leads to
>
> c) Sums, etc. should always skip ignored data. If missing data is present,
> but not ignored, then a sum should return NA. The main danger I see here is
> that the behavior of arrays becomes state dependent, something that can lead
> to subtle problems. Explicit request for a particular behavior, as is done
> now, might be preferable for its clarity.
>
> d) I think views are a good way add another mask layer to existing arrays.
>
> And for implementation:
>
> a) Ufunc loop support. This is most easily done with explicit masks.
>
> b) Apropos a), I'm coming (again) to the opinion that byte masks are the
> simplest and most general implementation.

Unfortunately, I think that there are more fundamental disagreements
to address before we worry about these questions. Even more
unfortunately, I've just spent a bunch of time trying to articulate
what those are, but it's in a draft of this summary Mark and I are
working on, which I can't really share until he's looked at... so, I
don't want to ignore your attempts to move forward, but can I ask you
to look for my response in a day or two and in another thread? :-)

-- Nathaniel