Re: [Numpy-discussion] alterNEP - was: missing data discussion round 2
On Jul 1, 2011 7:14 PM, "Mark Wiebe" <mwwiebe@gmail.com> wrote:
On Fri, Jul 1, 2011 at 10:15 AM, Nathaniel Smith <njs@pobox.com> wrote:
On Fri, Jul 1, 2011 at 7:09 AM, Mark Wiebe <mwwiebe@gmail.com> wrote:
On Fri, Jul 1, 2011 at 6:58 AM, Matthew Brett <matthew.brett@gmail.com> wrote:
Do you see problems with the alterNEP proposal?
Yes, I really like my design as it stands now, and the alterNEP removes
lot of the abstraction and interoperability that are in my opinion the best parts. I've made more updates to the NEP based on continuing feedback, which are part of the pull request I want reviews for.
If so, what are they?
Mainly: Reduced interoperability, more complex implementation (leading to more bugs), and an unclear theoretical model for the masked part of it.
Can you give any examples of situations where one would run into this "reduced interoperability"? I'm not sure what it means. The only person who has so far spoken up as needing both masking semantics and NA semantics -- Gary Strangman -- has said that he strongly prefers the alterNEP semantics *exactly because* it makes it clear *how these functions will interoperate.*
I've given examples before, but here are a few:
1) You're using NA dtypes. You realize you want multiple views of the same data with different choices of NA. You switch to masked arrays with a few
a lines of code changes. Multiple NAs? AFAIU, there's only one NA (per type) but several choices to allocate a IGNORE depending on the situation.
2) You're using masks. You realize that you will save memory/disk space if you switch to NA dtypes, and it's possible because it turned out that while you thought you would need masking, you came up with a new algorithm that didn't require it.
3) You're writing matplotlib, and you want to support all forms of NA-style data. You write it once instead of twice. Repeat for all other open
Ok, your IGNOREs become N'as because you want to... source libraries that want to do this. You switch your NAs to IGNOREs, ok, your call again.
Can you give any examples of how the implementation would be more complicated? As far as I can tell there are no elements in the alterNEP that are not in your NEP, they mostly just expose the functionality differently at the top level.
If that is the case, then it should be easy to change to your model after
the implementation is complete. I'm happy with that, these style of design choices are easier to make when you're comparing actual usage than hypotheticals.
Do you have a clearer theoretical model for the masked part of your proposal?
Yes, exactly the same model used for NA dtypes.
The best I've been able to extract from any of your messages is when you wrote "it seems to me that people wanting masked arrays want missing data without touching their data". But as a matter of English grammar, I have no idea what this means -- if you have data, it's not missing!
Ok, missing data-like functionality, which is provided by the solid theory
behind the missing data. Which is a subset of 'masked/to ignore' data...
It seems to me that people wanting masked data want to *hide* parts of their data, which seems much clearer to me and is the theoretical model used in the alterNEP.
Once you've hidden it, isn't it now missing?
Note that this model actually predicts several of the differences between how people want masks to work and how people want NAs to work (e.g., their behavior during reduction); I
Do you agree that the alterNEP proposal is easier to understand?
No.
If not, can you explain why?
My answers to that are already scattered in the emails in various
and in the various rationales and justifications provided in the NEP.
I understand the desire not to get caught up in spending all your time writing emails explaining things that you feel like you've already explained.
Maybe there's an email I missed somewhere where you explain the conceptual model behind your NEP's semantics in a short, easy-to-understand way (comparable to, say, the Rationale section of the alterNEP). But I haven't seen it and I can't reconstruct a rationale for it myself (the alterNEP comes out of my attempts to do so!).
I've been repeatedly updating the NEP. In particular this "round 2" email was an attempt to clarify between the two missing data models (what's being called NA and IGNORE), and the two implementation techniques (NA bit
Only temporarily, you can revert to not hidden when needed. If a data is flagged as NA, it should never be accessible again. places, patterns and masks). I've argued that these are completely independent from each other.
What do you see as the important points of difference between the NEP and the alterNEP?
The biggest thing is the NEP supports more use cases in a clean way by composition of different simpler components. It defines one clear
missing
data abstraction, and proposes two implementations that are interchangeable and can interoperate.
But the two implementations in your proposal are not interchangeable! The whole justification for starting with a masked-based implementation in your proposal is that it supports unmasking via views; if that requirement were removed, then there would be no reason to bother with the masking-based implementation at all.
They are interchangeable 100% with regard to the missing data semantics. Views are an orthogonal feature, and it is through composition of these two features that the masks gain this power.
Well, that's not true. There are some marginal advantages in the special case of working with integers+NAs. But I don't think anyone's making that argument.
The alterNEP proposes two independent APIs, reducing interoperability and so significantly increasing the amount of learning required to work with both of them. This also precludes switching
between
the two approaches without a lot of work.
You can't switch between Python and C without a lot of work too, but that doesn't mean that they should be merged into one design... but they do complement each other beautifully. Just like missing data and masked arrays :-).
This last statement is why I feel like you haven't been reading my emails. I've clearly positioned masks as an implementation technique, not implying any specific semantics.
The current pull request that's sitting there waiting for review does
not
have an impact on which approach goes ahead, but the code I'm doing now does. This is a fairly large project, and I don't have a great length of time to do it in, so I'm not going to participate extensively in the alterNEP discussion. If you want to help me, please review my code and provide specific feedback on my NEP (the code review system in github is great for this too, I've received some excellent feedback on the NEP
way). If you want to change my mind about things, please address the specific design decisions you think are problematic by specifically responding to lines in the NEP, as part of code-reviewing my pull request in github.
I know I'm being grumpy in this email, and I apologize for that. But, no. I've given extensive feedback, read the list carefully, and thought hard about these issues, and so far you've basically just dismissed my concerns. (See, e.g., [1], where your response to "we have to choose whether it's possible to recover data after it has been masked/NAed/whatever" is "no we don't, it should be both possible and impossible", which, I mean, what?) I've done my best to express them clearly, in the best way I know how -- and that way is *not* line by line comments on your NEP, because my concerns are more fundamental than that.
I've likewise read your emails carefully, and really appreciated that you jumped in right at the beginning with a good explanation of R's missing value semantics. I think line by line comments on the NEP expressing where
I'll check your code, but conceptually, NAs and IGNOREs are NOT interchangeable. that the fundamental problems would help us communicate better. I've tried to tease apart the distinction between the missing value abstractions and the implementation techniques, and I haven't seen the fact that you read that reflected in your emails. If you have a good reason why implementing something with masks implies certain semantics, please explain, dealing with the points that I've laid out arguing for this design choice in the latest NEP, accessible via the pull request.
I am of course happy to answer questions and such if there are places where I've been unclear.
And of course it's your prerogative to decide how you want to spend your time (well, yours and your employer's, I guess), which forums you want to participate in, what code you want to write, etc. If you have decided that you are tired to talking about this and want to just go off and implement something, then good luck (and I do mean that, it isn't sarcasm).
I do want to constructively engage the community at the same time as I do
the implementation, and I have a track record of producing good interfaces even when the underlying functionality is complex. I've had very positive feedback about einsum from people who deal with multiple arrays of multidimensional data and were missing an easy way to do that kind of operation.
But as far as I can tell right now, every single person who has experience with handling missing data for statistical purposes (esp. in R) has real concerns about your proposal, and AFAICT the community has very much *not* reached consensus on how these features should look. So I guess my question is, once you've spent your limited time on writing this code -- how confident are you that it will be merged? This isn't a threat or anything, I have no power over what gets merged, but -- it seems to me that there's a real chance that you'll do this work and then it will go down in flames, or that it will be merged and then the people you're trying to target will ignore it anyway. This is why we try to build consensus first, right? I would love to find some way to make everyone happy (and have been doing what I can on that front), but right now I am not happy, other people are not happy, and you're communicating that you don't think that matters. I'd love for that to change.
Building consensus is general virtually impossible, I'm for example very
impressed with the C++ standards committee's success in achieving it where they have. My development process is different from what you're describing, Like with datetime, I am merging periodically, not doing one big merge at the end. There's a reason why design by committee is frowned upon. The feedback is great, but still needs to go through a very strict software design quality filter.
-Mark
-- Nathaniel
[1]
http://mail.scipy.org/pipermail/numpy-discussion/2011-June/057274.html
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
participants (1)
-
Pierre GM