[Numpy-discussion] consensus (was: NA masks in the next numpy release?)

Sat Oct 29 05:22:17 EDT 2011

On Sat, Oct 29, 2011 at 3:32 AM, Charles R Harris <charlesr.harris at gmail.com
> wrote:

>
>
> On Fri, Oct 28, 2011 at 6:45 PM, Wes McKinney <wesmckinn at gmail.com> wrote:
>
>> On Fri, Oct 28, 2011 at 7:53 PM, Benjamin Root <ben.root at ou.edu> wrote:
>> >
>> >
>> > On Friday, October 28, 2011, Matthew Brett <matthew.brett at gmail.com>
>> wrote:
>> >> Hi,
>> >>
>> >> On Fri, Oct 28, 2011 at 4:21 PM, Ralf Gommers
>> >> <ralf.gommers at googlemail.com> wrote:
>> >>>
>> >>>
>> >>> On Sat, Oct 29, 2011 at 12:37 AM, Matthew Brett <
>> matthew.brett at gmail.com>
>> >>> wrote:
>> >>>>
>> >>>> Hi,
>> >>>>
>> >>>> On Fri, Oct 28, 2011 at 3:14 PM, Charles R Harris
>> >>>> <charlesr.harris at gmail.com> wrote:
>> >>>> >
>> >>>> >
>> >>>> > On Fri, Oct 28, 2011 at 3:56 PM, Matthew Brett
>> >>>> > <matthew.brett at gmail.com>
>> >>>> > wrote:
>> >>>> >>
>> >>>> >> Hi,
>> >>>> >>
>> >>>> >> On Fri, Oct 28, 2011 at 2:43 PM, Matthew Brett
>> >>>> >> <matthew.brett at gmail.com>
>> >>>> >> wrote:
>> >>>> >> > Hi,
>> >>>> >> >
>> >>>> >> > On Fri, Oct 28, 2011 at 2:41 PM, Charles R Harris
>> >>>> >> > <charlesr.harris at gmail.com> wrote:
>> >>>> >> >>
>> >>>> >> >>
>> >>>> >> >> On Fri, Oct 28, 2011 at 3:16 PM, Nathaniel Smith <
>> njs at pobox.com>
>> >>>> >> >> wrote:
>> >>>> >> >>>
>> >>>> >> >>> On Tue, Oct 25, 2011 at 2:56 PM, Travis Oliphant
>> >>>> >> >>> <oliphant at enthought.com>
>> >>>> >> >>> wrote:
>> >>>> >> >>> > I think Nathaniel and Matthew provided very
>> >>>> >> >>> > specific feedback that was helpful in understanding other
>> >>>> >> >>> > perspectives
>> >>>> >> >>> > of a
>> >>>> >> >>> > difficult problem.     In particular, I really wanted
>> >>>> >> >>> > bit-patterns
>> >>>> >> >>> > implemented.    However, I also understand that Mark did
>> quite
>> >>>> >> >>> > a
>> >>>> >> >>> > bit
>> >>>> >> >>> > of
>> >>>> >> >>> > work
>> >>>> >> >>> > and altered his original designs quite a bit in response to
>> >>>> >> >>> > community
>> >>>> >> >>> > feedback.   I wasn't a major part of the pull request
>> >>>> >> >>> > discussion,
>> >>>> >> >>> > nor
>> >>>> >> >>> > did I
>> >>>> >> >>> > merge the changes, but I support Charles if he reviewed the
>> >>>> >> >>> > code
>> >>>> >> >>> > and
>> >>>> >> >>> > felt
>> >>>> >> >>> > like it was the right thing to do.  I likely would have done
>> >>>> >> >>> > the
>> >>>> >> >>> > same
>> >>>> >> >>> > thing
>> >>>> >> >>> > rather than let Mark Wiebe's work languish.
>> >>>> >> >>>
>> >>>> >> >>> My connectivity is spotty this week, so I'll stay out of the
>> >>>> >> >>> technical
>> >>>> >> >>> discussion for now, but I want to share a story.
>> >>>> >> >>>
>> >>>> >> >>> Maybe a year ago now, Jonathan Taylor and I were debating what
>> >>>> >> >>> the
>> >>>> >> >>> best API for describing statistical models would be -- whether
>> we
>> >>>> >> >>> wanted something like R's "formulas" (which I supported), or
>> >>>> >> >>> another
>> >>>> >> >>> approach based on sympy (his idea). To summarize, I thought
>> his
>> >>>> >> >>> API
>> >>>> >> >>> was confusing, pointlessly complicated, and didn't actually
>> solve
>> >>>> >> >>> the
>> >>>> >> >>> problem; he thought R-style formulas were superficially
>> simpler
>> >>>> >> >>> but
>> >>>> >> >>> hopelessly confused and inconsistent underneath. Now,
>> obviously,
>> >>>> >> >>> I
>> >>>> >> >>> was
>> >>>> >> >>> right and he was wrong. Well, obvious to me, anyway... ;-) But
>> it
>> >>>> >> >>> wasn't like I could just wave a wand and make his arguments go
>> >>>> >> >>> away,
>> >>>> >> >>> no I should point out that the implementation hasn't - as far
>> as
>> >>>> >> >>> I can
>> >> see - changed the discussion.  The discussion was about the API.
>> >> Implementations are useful for agreed APIs because they can point out
>> >> where the API does not make sense or cannot be implemented.  In this
>> >> case, the API Mark said he was going to implement - he did implement -
>> >> at least as far as I can see.  Again, I'm happy to be corrected.
>> >>
>> >>>> In saying that we are insisting on our way, you are saying,
>> implicitly,
>> >>>> 'I
>> >>>> am not going to negotiate'.
>> >>>
>> >>> That is only your interpretation. The observation that Mark
>> compromised
>> >>> quite a bit while you didn't seems largely correct to me.
>> >>
>> >> The problem here stems from our inability to work towards agreement,
>> >> rather than standing on set positions.  I set out what changes I think
>> >> would make the current implementation OK.  Can we please, please have
>> >> a discussion about those points instead of trying to argue about who
>> >> has given more ground.
>> >>
>> >>> That commitment would of course be good. However, even if that were
>> >>> possible
>> >>> before writing code and everyone agreed that the ideas of you and
>> >>> Nathaniel
>> >>> should be implemented in full, it's still not clear that either of you
>> >>> would
>> >>> be willing to write any code. Agreement without code still doesn't
>> help
>> >>> us
>> >>> very much.
>> >>
>> >> I'm going to return to Nathaniel's point - it is a highly valuable
>> >> thing to set ourselves the target of resolving substantial discussions
>> >> by consensus.   The route you are endorsing here is 'implementor
>> >> wins'.   We don't need to do it that way.  We're a mature sensible
>> >> bunch of adults who can talk out the issues until we agree they are
>> >> ready for implementation, and then implement.  That's all Nathaniel is
>> >> saying.  I think he's obviously right, and I'm sad that it isn't as
>> >> clear to y'all as it is to me.
>> >>
>> >> Best,
>> >>
>> >> Matthew
>> >>
>> >
>> > Everyone, can we please not do this?! I had enough of adults doing
>> finger
>> > pointing back over the summer during the whole debt ceiling debate.  I
>> think
>> > we can all agree that we are better than the US congress?
>> >
>> > Forget about rudeness or decision processes.
>> >
>> > I will start by saying that I am willing to separate ignore and absent,
>> but
>> > only on the write side of things.  On read, I want a single way to
>> identify
>> > the missing values.  I also want only a single way to perform
>> calculations
>> > (either skip or propagate).
>> >
>> > An indicator of success would be that people stop using NaNs and magic
>> > numbers (-9999, anyone?) and we could even deprecate nansum(), or at
>> least
>> > strongly suggest in its docs to use NA.
>>
>> Well, I haven't completely made up my mind yet, will have to do some
>> more prototyping and playing (and potentially have some of my users
>> eat the differently-flavored dogfood), but I'm really not very
>> satisfied with the API at the moment. I'm mainly worried about the
>> abstraction leaking through to pandas users (this is a pretty large
>> group of people judging by # of downloads).
>>
>> The basic position I'm in is that I'm trying to push Python into a new
>> space, namely mainstream data analysis and statistical computing, one
>> that is solidly occupied by R and other such well-known players. My
>> target users are not computer scientists. They are not going to invest
>> in understanding dtypes very deeply or the internals of ndarray. In
>> fact I've spent a great deal of effort making it so that pandas users
>> can be productive and successful while having very little
>> understanding of NumPy. Yes, I essentially "protect" my users from
>> NumPy because using it well requires a certain level of sophistication
>> that I think is unfair to demand of people. This might seem totally
>> bizarre to some of you but it is simply the state of affairs. So far I
>> have been successful because more people are using Python and pandas
>> to do things that they used to do in R. The NA concept in R is dead
>> simple and I don't see why we are incapable of also implementing
>> something that is just as dead simple. To we, the scipy elite let's
>> call us, it seems simple: "oh, just pass an extra flag to all my array
>> constructors!" But this along with the masked array concept is going
>> to have two likely outcomes:
>>
>> 1) Create a great deal more complication in my already very large codebase
>>
>> and/or
>>
>> 2) force pandas users to understand the new masked arrays after I've
>> carefully made it so they can be largely ignorant of NumPy
>>
>> The mostly-NaN-based solution I've cobbled together and tweaked over
>> the last 42 months actually *works really well*, amazingly, with
>> relatively little cost in code complexity. Having found a reasonably
>> stable equilibrium I'm extremely resistant to upset the balance.
>>
>> So I don't know. After watching these threads bounce back and forth
>> I'm frankly not all that hopeful about a solution arising that
>> actually addresses my needs.
>>
>
> But Wes, what *are* your needs? You keep saying this, but we need examples
> of how you want to operate and how numpy fails. As to dtypes, internals, and
> all that, I don't see any of that in the current implementation, unless you
> mean the maskna and skipna keywords. I believe someone on the previous
> thread mentioned a way to deal with that.
>

>From the release notes I just learned that skipna is basically the same as
in R:
"R's parameter rm.na=T is spelled skipna=True in NumPy."

It provides a good summary of the current status in master:
https://github.com/numpy/numpy/blob/master/doc/release/2.0.0-notes.rst

Ralf
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20111029/a47e6300/attachment.html>