[Numpy-discussion] alterNEP - was: missing data discussion round 2

Christopher Jordan-Squire cjordan1 at uw.edu
Fri Jul 1 12:29:50 EDT 2011


This is kind of late to be jumping into the 'long thread of doom', but I've
been following most of the posts, so I'd figured I'd throw in my 2 cents.
I'm Mark's officemate over the summer, and we've been talking daily about
his design. I was skeptical of various details at first, but by now Mark's
largely sold me on his design. Though, FWIW, my background is largely
statistical uses of arrays rather than scientific uses, so I grok missing
data usage more naturally than masking.

On Fri, Jul 1, 2011 at 10:15 AM, Nathaniel Smith <njs at pobox.com> wrote:

> On Fri, Jul 1, 2011 at 7:09 AM, Mark Wiebe <mwwiebe at gmail.com> wrote:
> > On Fri, Jul 1, 2011 at 6:58 AM, Matthew Brett <matthew.brett at gmail.com>
> > wrote:
> >> Do you see problems with the alterNEP proposal?
> >
> > Yes, I really like my design as it stands now, and the alterNEP removes a
> > lot of the abstraction and interoperability that are in my opinion the
> best
> > parts. I've made more updates to the NEP based on continuing feedback,
> which
> > are part of the pull request I want reviews for.
> >
> >>
> >> If so, what are they?
> >
> > Mainly: Reduced interoperability, more complex implementation (leading to
> > more bugs), and an unclear theoretical model for the masked part of it.
>
> Can you give any examples of situations where one would run into this
> "reduced interoperability"? I'm not sure what it means. The only
> person who has so far spoken up as needing both masking semantics and
> NA semantics -- Gary Strangman -- has said that he strongly prefers
> the alterNEP semantics *exactly because* it makes it clear *how these
> functions will interoperate.*
>
> Can you give any examples of how the implementation would be more
> complicated? As far as I can tell there are no elements in the
> alterNEP that are not in your NEP, they mostly just expose the
> functionality differently at the top level.
>
> Do you have a clearer theoretical model for the masked part of your
> proposal? The best I've been able to extract from any of your messages
> is when you wrote "it seems to me that people wanting masked arrays
> want missing data without touching their data". But as a matter of
> English grammar, I have no idea what this means -- if you have data,
> it's not missing! It seems to me that people wanting masked data want
> to *hide* parts of their data, which seems much clearer to me and is
> the theoretical model used in the alterNEP. Note that this model
> actually predicts several of the differences between how people want
> masks to work and how people want NAs to work (e.g., their behavior
> during reduction); I
>
>
I looked over the theoretical mode in the aNEP, and I disagree with it. I
think a masked array is just that: an array with a mask. Do whatever with
the mask, but it's up to the user to decide how they want to use it. It
doesn't seem like it has to come with a theoretical model. (Unlike missing
data, which comes which does have a nice theoretical model.)

The theoretical model in the aNEP seems to assume too much. I'm thinking in
particular of this idea: "a length-4 array in which the last value has been
masked out behaves just like an ordinary length-3 array, so long as you
don't change the mask." That's forcing a notion of column/position
independence on the masked array, in that any function operating on the rows
must treat each column the same. And I'm don't think that's part of the
contract that should come from creating a masked array.


>> Do you agree that the alterNEP proposal is easier to understand?
> >
> > No.
> >>
> >> If not, can you explain why?
> >
> > My answers to that are already scattered in the emails in various places,
> > and in the various rationales and justifications provided in the NEP.
>
> I understand the desire not to get caught up in spending all your time
> writing emails explaining things that you feel like you've already
> explained.
>
> Maybe there's an email I missed somewhere where you explain the
> conceptual model behind your NEP's semantics in a short,
> easy-to-understand way (comparable to, say, the Rationale section of
> the alterNEP). But I haven't seen it and I can't reconstruct a
> rationale for it myself (the alterNEP comes out of my attempts to do
> so!).
>
> >> What do you see as the important points of difference between the NEP
> >> and the alterNEP?
> >
> > The biggest thing is the NEP supports more use cases in a clean way by
> > composition of different simpler components. It defines one clear missing
> > data abstraction, and proposes two implementations that are
> interchangeable
> > and can interoperate.
>
> But the two implementations in your proposal are not interchangeable!
> The whole justification for starting with a masked-based
> implementation in your proposal is that it supports unmasking via
> views; if that requirement were removed, then there would be no reason
> to bother with the masking-based implementation at all.
>
> Well, that's not true. There are some marginal advantages in the
> special case of working with integers+NAs. But I don't think anyone's
> making that argument.
>
> > The alterNEP proposes two independent APIs, reducing
> > interoperability and so significantly increasing the amount of learning
> > required to work with both of them. This also precludes switching between
> > the two approaches without a lot of work.
>
> You can't switch between Python and C without a lot of work too, but
> that doesn't mean that they should be merged into one design... but
> they do complement each other beautifully. Just like missing data and
> masked arrays :-).
>
> > The current pull request that's sitting there waiting for review does not
> > have an impact on which approach goes ahead, but the code I'm doing now
> > does. This is a fairly large project, and I don't have a great length of
> > time to do it in, so I'm not going to participate extensively in the
> > alterNEP discussion. If you want to help me, please review my code and
> > provide specific feedback on my NEP (the code review system in github is
> > great for this too, I've received some excellent feedback on the NEP that
> > way). If you want to change my mind about things, please address the
> > specific design decisions you think are problematic by specifically
> > responding to lines in the NEP, as part of code-reviewing my pull request
> in
> > github.
>
> I know I'm being grumpy in this email, and I apologize for that. But,
> no. I've given extensive feedback, read the list carefully, and
> thought hard about these issues, and so far you've basically just
> dismissed my concerns. (See, e.g., [1], where your response to "we
> have to choose whether it's possible to recover data after it has been
> masked/NAed/whatever" is "no we don't, it should be both possible and
> impossible", which, I mean, what?) I've done my best to express them
> clearly, in the best way I know how -- and that way is *not* line by
> line comments on your NEP, because my concerns are more fundamental
> than that.
>
> I am of course happy to answer questions and such if there are places
> where I've been unclear.
>
> And of course it's your prerogative to decide how you want to spend
> your time (well, yours and your employer's, I guess), which forums you
> want to participate in, what code you want to write, etc. If you have
> decided that you are tired to talking about this and want to just go
> off and implement something, then good luck (and I do mean that, it
> isn't sarcasm).
>
> But as far as I can tell right now, every single person who has
> experience with handling missing data for statistical purposes (esp.
> in R) has real concerns about your proposal, and AFAICT the community
> has very much *not* reached consensus on how these features should
> look. So I guess my question is, once you've spent your limited time
> on writing this code -- how confident are you that it will be merged?
> This isn't a threat or anything, I have no power over what gets
> merged, but -- it seems to me that there's a real chance that you'll
> do this work and then it will go down in flames, or that it will be
> merged and then the people you're trying to target will ignore it
> anyway. This is why we try to build consensus first, right? I would
> love to find some way to make everyone happy (and have been doing what
> I can on that front), but right now I am not happy, other people are
> not happy, and you're communicating that you don't think that matters.
> I'd love for that to change.
>

I'm a statistics grad students and an R user, and I'm mostly ok with what
Mark is doing.

Currently, as I understand it, Mark is working on a structure that will make
missing data into a first class citizen in the numpy world. This is great!
Before it had been more of a 2nd class-citizen. And Mark is even trying to
copy R semantics as much as possible.

It's true that Mark's making it so the masked part of these new arrays won't
be as front and center. The functionality will be there and it will be easy
to used. But it will be based more on an explicit contract that the data
memory contents of a masked array will not be overwritten when the data is
masked. So I don't think Mark is making anything implicit--he's making a
very explicit contract about how the data memory is handled when the mask is
changed.

If I understand correctly, it seems like the main objection to Mark's
current API is that the explicit contract about data memory isn't somehow
immediately visible in the API. It's true this is a trade-off, but it leads
to a simpler API with easier ability to use all features at once at the
pretty small cost of the user just having to read enough to realize that
there's an explicit contract about what happens to the memory of a masked
value, and they can access it by taking a view. That's easy enough to add at
the very beginning of the documentation.

-Chris JS



>
> -- Nathaniel
>
> [1] http://mail.scipy.org/pipermail/numpy-discussion/2011-June/057274.html
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20110701/6d001b76/attachment.html>


More information about the NumPy-Discussion mailing list