Mailman 3 Re: [Numpy-discussion] alterNEP - was: missing data discussion round 2 - NumPy-Discussion

1 Jul 2011

      On Jul 1, 2011 7:14 PM, "Mark Wiebe" <mwwiebe@gmail.com> wrote:
...
On Fri, Jul 1, 2011 at 10:15 AM, Nathaniel Smith <njs@pobox.com> wrote:
...
On Fri, Jul 1, 2011 at 7:09 AM, Mark Wiebe <mwwiebe@gmail.com> wrote:
...
On Fri, Jul 1, 2011 at 6:58 AM, Matthew Brett <matthew.brett@gmail.com>
wrote:
...
Do you see problems with the alterNEP proposal?
Yes, I really like my design as it stands now, and the alterNEP removes
...
...
...
lot of the abstraction and interoperability that are in my opinion the
best
parts. I've made more updates to the NEP based on continuing feedback,
which
are part of the pull request I want reviews for.
...
If so, what are they?
Mainly: Reduced interoperability, more complex implementation (leading
to
more bugs), and an unclear theoretical model for the masked part of it.
Can you give any examples of situations where one would run into this
"reduced interoperability"? I'm not sure what it means. The only
person who has so far spoken up as needing both masking semantics and
NA semantics -- Gary Strangman -- has said that he strongly prefers
the alterNEP semantics *exactly because* it makes it clear *how these
functions will interoperate.*
I've given examples before, but here are a few:
1) You're using NA dtypes. You realize you want multiple views of the same
data with different choices of NA. You switch to masked arrays with a few
a
lines of code changes.

Multiple NAs? AFAIU, there's only one NA (per type) but several choices to
allocate a IGNORE depending on the situation.
...
2) You're using masks. You realize that you will save memory/disk space if
you switch to NA dtypes, and it's possible because it turned out that while
you thought you would need masking, you came up with a new algorithm that
didn't require it.
...
3) You're writing matplotlib, and you want to support all forms of
NA-style data. You write it once instead of twice. Repeat for all other open
Ok, your IGNOREs become N'as because you want to...

source libraries that want to do this.

You switch your NAs to IGNOREs, ok, your call again.
...
...
Can you give any examples of how the implementation would be more
complicated? As far as I can tell there are no elements in the
alterNEP that are not in your NEP, they mostly just expose the
functionality differently at the top level.
If that is the case, then it should be easy to change to your model after
the implementation is complete. I'm happy with that, these style of design
choices are easier to make when you're comparing actual usage than
hypotheticals.
...
...
Do you have a clearer theoretical model for the masked part of your
proposal?
Yes, exactly the same model used for NA dtypes.
...
The best I've been able to extract from any of your messages
is when you wrote "it seems to me that people wanting masked arrays
want missing data without touching their data". But as a matter of
English grammar, I have no idea what this means -- if you have data,
it's not missing!
Ok, missing data-like functionality, which is provided by the solid theory
behind the missing data.

Which is a subset of 'masked/to ignore' data...
...
...
It seems to me that people wanting masked data want
to *hide* parts of their data, which seems much clearer to me and is
the theoretical model used in the alterNEP.
Once you've hidden it, isn't it now missing?
...
...
Note that this model
actually predicts several of the differences between how people want
masks to work and how people want NAs to work (e.g., their behavior
during reduction); I
...
...
...
Do you agree that the alterNEP proposal is easier to understand?
No.
...
If not, can you explain why?
My answers to that are already scattered in the emails in various
...
...
...
and in the various rationales and justifications provided in the NEP.
I understand the desire not to get caught up in spending all your time
writing emails explaining things that you feel like you've already
explained.
Maybe there's an email I missed somewhere where you explain the
conceptual model behind your NEP's semantics in a short,
easy-to-understand way (comparable to, say, the Rationale section of
the alterNEP). But I haven't seen it and I can't reconstruct a
rationale for it myself (the alterNEP comes out of my attempts to do
so!).
I've been repeatedly updating the NEP. In particular this "round 2" email
was an attempt to clarify between the two missing data models (what's being
called NA and IGNORE), and the two implementation techniques (NA bit
Only temporarily, you can revert to not hidden when needed.
If a data is flagged as NA, it should never be accessible again.

places,
patterns and masks). I've argued that these are completely independent from
each other.
...
...
...
...
What do you see as the important points of difference between the NEP
and the alterNEP?
The biggest thing is the NEP supports more use cases in a clean way by
composition of different simpler components. It defines one clear
missing
...
...
...
data abstraction, and proposes two implementations that are
interchangeable
and can interoperate.
But the two implementations in your proposal are not interchangeable!
The whole justification for starting with a masked-based
implementation in your proposal is that it supports unmasking via
views; if that requirement were removed, then there would be no reason
to bother with the masking-based implementation at all.
They are interchangeable 100% with regard to the missing data semantics.
Views are an orthogonal feature, and it is through composition of these two
features that the masks gain this power.
...
...
Well, that's not true. There are some marginal advantages in the
special case of working with integers+NAs. But I don't think anyone's
making that argument.
...
The alterNEP proposes two independent APIs, reducing
interoperability and so significantly increasing the amount of learning
required to work with both of them. This also precludes switching
between
...
...
the two approaches without a lot of work.
You can't switch between Python and C without a lot of work too, but
that doesn't mean that they should be merged into one design... but
they do complement each other beautifully. Just like missing data and
masked arrays :-).
This last statement is why I feel like you haven't been reading my emails.
I've clearly positioned masks as an implementation technique, not implying
any specific semantics.
...
...
The current pull request that's sitting there waiting for review does
not
...
...
have an impact on which approach goes ahead, but the code I'm doing now
does. This is a fairly large project, and I don't have a great length
of
time to do it in, so I'm not going to participate extensively in the
alterNEP discussion. If you want to help me, please review my code and
provide specific feedback on my NEP (the code review system in github
is
great for this too, I've received some excellent feedback on the NEP
...
...
...
way). If you want to change my mind about things, please address the
specific design decisions you think are problematic by specifically
responding to lines in the NEP, as part of code-reviewing my pull
request in
github.
I know I'm being grumpy in this email, and I apologize for that. But,
no. I've given extensive feedback, read the list carefully, and
thought hard about these issues, and so far you've basically just
dismissed my concerns. (See, e.g., [1], where your response to "we
have to choose whether it's possible to recover data after it has been
masked/NAed/whatever" is "no we don't, it should be both possible and
impossible", which, I mean, what?) I've done my best to express them
clearly, in the best way I know how -- and that way is *not* line by
line comments on your NEP, because my concerns are more fundamental
than that.
I've likewise read your emails carefully, and really appreciated that you
jumped in right at the beginning with a good explanation of R's missing
value semantics. I think line by line comments on the NEP expressing where
I'll check your code, but conceptually, NAs and IGNOREs are NOT
interchangeable.

that
the fundamental problems would help us communicate better. I've tried to
tease apart the distinction between the missing value abstractions and the
implementation techniques, and I haven't seen the fact that you read that
reflected in your emails. If you have a good reason why implementing
something with masks implies certain semantics, please explain, dealing with
the points that I've laid out arguing for this design choice in the latest
NEP, accessible via the pull request.
...
...
I am of course happy to answer questions and such if there are places
where I've been unclear.
And of course it's your prerogative to decide how you want to spend
your time (well, yours and your employer's, I guess), which forums you
want to participate in, what code you want to write, etc. If you have
decided that you are tired to talking about this and want to just go
off and implement something, then good luck (and I do mean that, it
isn't sarcasm).
I do want to constructively engage the community at the same time as I do
the implementation, and I have a track record of producing good interfaces
even when the underlying functionality is complex. I've had very positive
feedback about einsum from people who deal with multiple arrays of
multidimensional data and were missing an easy way to do that kind of
operation.
...
...
But as far as I can tell right now, every single person who has
experience with handling missing data for statistical purposes (esp.
in R) has real concerns about your proposal, and AFAICT the community
has very much *not* reached consensus on how these features should
look. So I guess my question is, once you've spent your limited time
on writing this code -- how confident are you that it will be merged?
This isn't a threat or anything, I have no power over what gets
merged, but -- it seems to me that there's a real chance that you'll
do this work and then it will go down in flames, or that it will be
merged and then the people you're trying to target will ignore it
anyway. This is why we try to build consensus first, right? I would
love to find some way to make everyone happy (and have been doing what
I can on that front), but right now I am not happy, other people are
not happy, and you're communicating that you don't think that matters.
I'd love for that to change.
Building consensus is general virtually impossible, I'm for example very
impressed with the C++ standards committee's success in achieving it where
they have. My development process is different from what you're describing,
Like with datetime, I am merging periodically, not doing one big merge at
the end. There's a reason why design by committee is frowned upon. The
feedback is great, but still needs to go through a very strict software
design quality filter.
...
-Mark
...
-- Nathaniel
[1]
http://mail.scipy.org/pipermail/numpy-discussion/2011-June/057274.html
...
...
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] alterNEP - was: missing data discussion round 2

Pierre GM

tags

participants (1)