[Numpy-discussion] consensus (was: NA masks in the next numpy release?)

Fri Oct 28 17:32:20 EDT 2011

Hi,

On Fri, Oct 28, 2011 at 2:16 PM, Nathaniel Smith <njs at pobox.com> wrote:
> On Tue, Oct 25, 2011 at 2:56 PM, Travis Oliphant <oliphant at enthought.com> wrote:
>> I think Nathaniel and Matthew provided very
>> specific feedback that was helpful in understanding other perspectives of a
>> difficult problem.     In particular, I really wanted bit-patterns
>> implemented.    However, I also understand that Mark did quite a bit of work
>> and altered his original designs quite a bit in response to community
>> feedback.   I wasn't a major part of the pull request discussion, nor did I
>> merge the changes, but I support Charles if he reviewed the code and felt
>> like it was the right thing to do.  I likely would have done the same thing
>> rather than let Mark Wiebe's work languish.
>
> My connectivity is spotty this week, so I'll stay out of the technical
> discussion for now, but I want to share a story.
>
> Maybe a year ago now, Jonathan Taylor and I were debating what the
> best API for describing statistical models would be -- whether we
> wanted something like R's "formulas" (which I supported), or another
> approach based on sympy (his idea). To summarize, I thought his API
> was confusing, pointlessly complicated, and didn't actually solve the
> problem; he thought R-style formulas were superficially simpler but
> hopelessly confused and inconsistent underneath. Now, obviously, I was
> right and he was wrong. Well, obvious to me, anyway... ;-) But it
> wasn't like I could just wave a wand and make his arguments go away,
> no matter how annoying and wrong-headed I thought they were... I could
> write all the code I wanted but no-one would use it unless I could
> convince them it's actually the right solution, so I had to engage
> with him, and dig deep into his arguments.
>
> What I discovered was that (as I thought) R-style formulas *do* have a
> solid theoretical basis -- but (as he thought) all the existing
> implementations *are* broken and inconsistent! I'm still not sure I
> can actually convince Jonathan to go my way, but, because of his
> stubbornness, I had to invent a better way of handling these formulas,
> and so my library[1] is actually the first implementation of these
> things that has a rigorous theory behind it, and in the process it
> avoids two fundamental, decades-old bugs in R. (And I'm not sure the R
> folks can fix either of them at this point without breaking a ton of
> code, since they both have API consequences.)
>
> --
>
> It's extremely common for healthy FOSS projects to insist on consensus
> for almost all decisions, where consensus means something like "every
> interested party has a veto"[2]. This seems counterintuitive, because
> if everyone's vetoing all the time, how does anything get done? The
> trick is that if anyone *can* veto, then vetoes turn out to actually
> be very rare. Everyone knows that they can't just ignore alternative
> points of view -- they have to engage with them if they want to get
> anything done. So you get buy-in on features early, and no vetoes are
> necessary. And by forcing people to engage with each other, like me
> with Jonathan, you get better designs.
>
> But what about the cost of all that code that doesn't get merged, or
> written, because everyone's spending all this time debating instead?
> Better designs are nice and all, but how does that justify letting
> working code languish?
>
> The greatest risk for a FOSS project is that people will ignore you.
> Projects and features live and die by community buy-in. Consider the
> "NA mask" feature right now. It works (at least the parts of it that
> are implemented). It's in mainline. But IIRC, Pierre said last time
> that he doesn't think the current design will help him improve or
> replace numpy.ma. Up-thread, Wes McKinney is leaning towards ignoring
> this feature in favor of his library pandas' current hacky NA support.
> Members of the neuroimaging crowd are saying that the memory overhead
> is too high and the benefits too marginal, so they'll stick with NaNs.
> Together these folk a huge proportion of the this feature's target
> audience. So what have we actually accomplished by merging this to
> mainline? Are we going to be stuck supporting a feature that only a
> fraction of the target audience actually uses? (Maybe they're being
> dumb, but if people are ignoring your code for dumb reasons... they're
> still ignoring your code.)
>
> The consensus rule forces everyone to do the hardest and riskiest part
> -- building buy-in -- up front. Because you *have* to do it sooner or
> later, and doing it sooner doesn't just generate better designs. It
> drastically reduces the risk of ending up in a huge trainwreck.
>
> --
>
> In my story at the beginning, I wished I had a magic wand to skip this
> annoying debate and political stuff. But giving it to me would have
> been a bad idea. I think that's went wrong with the NA discussion in
> the first place. Mark's an excellent programmer, and he tried his best
> to act in the good of everyone in the project -- but in the end, he
> did have a wand like that. He didn't have that sense that he *had* to
> get everyone on board (even the people who were saying dumb things),
> or he'd just be wasting his time. He didn't ask Pierre if the NA
> design would actually work for numpy.ma's purposes -- I did.
>
> You may have noticed that I do have some ideas for about how NA
> support should work. But my ideas aren't really the important thing.
> The alter-NEP was my attempt to find common ground between the
> different needs people were bringing up, so we could discuss whether
> it would work for people or not. I'm not wedded to anything in it. But
> this is a complicated issue with a lot of conflicting interests, and
> we need to find something that actually does work for everyone (or as
> large a subset as is practical).
>
> So here's what I think we should do:
>  1) I will submit a pull request backing Mark's NA work out of
> mainline, for now. (This is more or less done, I just need to get it
> onto github, see above re: connectivity)
>  2) I will also put together a new branch containing that work,
> rebased against current mainline, so it doesn't get lost. (Ditto.)
>  3) And we'll decide what to do with it *after* we hammer out a
> design that the various NA-supporting groups all find convincing. Or
> at least a design for some of the less controversial pieces (like the
> 'where=' ufunc argument?), get those merged, and then iterate
> incrementally.
>
> What do you all think?

Nice post - thank you.

I agree that we may have a problem with - process.  I mean, maybe
there is not much agreement on what the process for these kinds of
discussions should be - and therefore - we can't point to some
constitution or similar to say - hey - wait - we're not doing it
right.

It seems to me - from my technical reply to Travis - that it would be
reasonable to keep Mark's implementation of masked arrays, but with
some minor modifications to keep IGNORED (implemented) separable
conceptually from ABSENT (not implemented).   Maybe the discussion
could be about those modifications?  Specifically, where do you feel
the points of disagreement are, after the masking idea has become
clearly an implementation of IGNORED?
I guess you also don't much care if the IGNORED default behavior is
PROPAGATE or SKIP.

I had thought about what would happen to numpy.ma - and I would really
like to know what Pierre would need for this implementation to allow
him to replace numpy.ma.

See you,

Matthew