[Numpy-discussion] alterNEP - was: missing data discussion round 2

Pierre GM pgmdevlist at gmail.com
Fri Jul 1 14:17:00 EDT 2011


On Jul 1, 2011 7:14 PM, "Mark Wiebe" <mwwiebe at gmail.com> wrote:
>
> On Fri, Jul 1, 2011 at 10:15 AM, Nathaniel Smith <njs at pobox.com> wrote:
>>
>> On Fri, Jul 1, 2011 at 7:09 AM, Mark Wiebe <mwwiebe at gmail.com> wrote:
>> > On Fri, Jul 1, 2011 at 6:58 AM, Matthew Brett <matthew.brett at gmail.com>
>> > wrote:
>> >> Do you see problems with the alterNEP proposal?
>> >
>> > Yes, I really like my design as it stands now, and the alterNEP removes
a
>> > lot of the abstraction and interoperability that are in my opinion the
best
>> > parts. I've made more updates to the NEP based on continuing feedback,
which
>> > are part of the pull request I want reviews for.
>> >
>> >>
>> >> If so, what are they?
>> >
>> > Mainly: Reduced interoperability, more complex implementation (leading
to
>> > more bugs), and an unclear theoretical model for the masked part of it.
>>
>> Can you give any examples of situations where one would run into this
>> "reduced interoperability"? I'm not sure what it means. The only
>> person who has so far spoken up as needing both masking semantics and
>> NA semantics -- Gary Strangman -- has said that he strongly prefers
>> the alterNEP semantics *exactly because* it makes it clear *how these
>> functions will interoperate.*
>
>
> I've given examples before, but here are a few:
>
> 1) You're using NA dtypes. You realize you want multiple views of the same
data with different choices of NA. You switch to masked arrays with a few
lines of code changes.

Multiple NAs? AFAIU, there's only one NA (per type) but several choices to
allocate a IGNORE depending on the situation.

> 2) You're using masks. You realize that you will save memory/disk space if
you switch to NA dtypes, and it's possible because it turned out that while
you thought you would need masking, you came up with a new algorithm that
didn't require it.

Ok, your IGNOREs become N'as because you want to...

> 3) You're writing matplotlib, and you want to support all forms of
NA-style data. You write it once instead of twice. Repeat for all other open
source libraries that want to do this.

You switch your NAs to IGNOREs, ok, your call again.

>
>>
>> Can you give any examples of how the implementation would be more
>> complicated? As far as I can tell there are no elements in the
>> alterNEP that are not in your NEP, they mostly just expose the
>> functionality differently at the top level.
>
>
> If that is the case, then it should be easy to change to your model after
the implementation is complete. I'm happy with that, these style of design
choices are easier to make when you're comparing actual usage than
hypotheticals.
>
>> Do you have a clearer theoretical model for the masked part of your
>> proposal?
>
>
> Yes, exactly the same model used for NA dtypes.
>
>>
>> The best I've been able to extract from any of your messages
>> is when you wrote "it seems to me that people wanting masked arrays
>> want missing data without touching their data". But as a matter of
>> English grammar, I have no idea what this means -- if you have data,
>> it's not missing!
>
>
> Ok, missing data-like functionality, which is provided by the solid theory
behind the missing data.

Which is a subset of 'masked/to ignore' data...

>>
>> It seems to me that people wanting masked data want
>> to *hide* parts of their data, which seems much clearer to me and is
>> the theoretical model used in the alterNEP.
>
>
> Once you've hidden it, isn't it now missing?

Only temporarily, you can revert to not hidden when needed.
If a data is flagged as NA, it should never be accessible again.

>>
>> Note that this model
>> actually predicts several of the differences between how people want
>> masks to work and how people want NAs to work (e.g., their behavior
>> during reduction); I
>
>
>>
>> >> Do you agree that the alterNEP proposal is easier to understand?
>> >
>> > No.
>> >>
>> >> If not, can you explain why?
>> >
>> > My answers to that are already scattered in the emails in various
places,
>> > and in the various rationales and justifications provided in the NEP.
>>
>> I understand the desire not to get caught up in spending all your time
>> writing emails explaining things that you feel like you've already
>> explained.
>>
>> Maybe there's an email I missed somewhere where you explain the
>> conceptual model behind your NEP's semantics in a short,
>> easy-to-understand way (comparable to, say, the Rationale section of
>> the alterNEP). But I haven't seen it and I can't reconstruct a
>> rationale for it myself (the alterNEP comes out of my attempts to do
>> so!).
>
>
> I've been repeatedly updating the NEP. In particular this "round 2" email
was an attempt to clarify between the two missing data models (what's being
called NA and IGNORE), and the two implementation techniques (NA bit
patterns and masks). I've argued that these are completely independent from
each other.
>
>>
>> >> What do you see as the important points of difference between the NEP
>> >> and the alterNEP?
>> >
>> > The biggest thing is the NEP supports more use cases in a clean way by
>> > composition of different simpler components. It defines one clear
missing
>> > data abstraction, and proposes two implementations that are
interchangeable
>> > and can interoperate.
>>
>> But the two implementations in your proposal are not interchangeable!
>> The whole justification for starting with a masked-based
>> implementation in your proposal is that it supports unmasking via
>> views; if that requirement were removed, then there would be no reason
>> to bother with the masking-based implementation at all.
>
>
> They are interchangeable 100% with regard to the missing data semantics.
Views are an orthogonal feature, and it is through composition of these two
features that the masks gain this power.

I'll check your code, but conceptually, NAs and IGNOREs are NOT
interchangeable.

>>
>> Well, that's not true. There are some marginal advantages in the
>> special case of working with integers+NAs. But I don't think anyone's
>> making that argument.
>>
>> > The alterNEP proposes two independent APIs, reducing
>> > interoperability and so significantly increasing the amount of learning
>> > required to work with both of them. This also precludes switching
between
>> > the two approaches without a lot of work.
>>
>> You can't switch between Python and C without a lot of work too, but
>> that doesn't mean that they should be merged into one design... but
>> they do complement each other beautifully. Just like missing data and
>> masked arrays :-).
>
>
> This last statement is why I feel like you haven't been reading my emails.
I've clearly positioned masks as an implementation technique, not implying
any specific semantics.
>
>>
>>
>> > The current pull request that's sitting there waiting for review does
not
>> > have an impact on which approach goes ahead, but the code I'm doing now
>> > does. This is a fairly large project, and I don't have a great length
of
>> > time to do it in, so I'm not going to participate extensively in the
>> > alterNEP discussion. If you want to help me, please review my code and
>> > provide specific feedback on my NEP (the code review system in github
is
>> > great for this too, I've received some excellent feedback on the NEP
that
>> > way). If you want to change my mind about things, please address the
>> > specific design decisions you think are problematic by specifically
>> > responding to lines in the NEP, as part of code-reviewing my pull
request in
>> > github.
>>
>> I know I'm being grumpy in this email, and I apologize for that. But,
>> no. I've given extensive feedback, read the list carefully, and
>> thought hard about these issues, and so far you've basically just
>> dismissed my concerns. (See, e.g., [1], where your response to "we
>> have to choose whether it's possible to recover data after it has been
>> masked/NAed/whatever" is "no we don't, it should be both possible and
>> impossible", which, I mean, what?) I've done my best to express them
>> clearly, in the best way I know how -- and that way is *not* line by
>> line comments on your NEP, because my concerns are more fundamental
>> than that.
>
>
> I've likewise read your emails carefully, and really appreciated that you
jumped in right at the beginning with a good explanation of R's missing
value semantics. I think line by line comments on the NEP expressing where
the fundamental problems would help us communicate better. I've tried to
tease apart the distinction between the missing value abstractions and the
implementation techniques, and I haven't seen the fact that you read that
reflected in your emails. If you have a good reason why implementing
something with masks implies certain semantics, please explain, dealing with
the points that I've laid out arguing for this design choice in the latest
NEP, accessible via the pull request.
>
>> I am of course happy to answer questions and such if there are places
>> where I've been unclear.
>>
>> And of course it's your prerogative to decide how you want to spend
>> your time (well, yours and your employer's, I guess), which forums you
>> want to participate in, what code you want to write, etc. If you have
>> decided that you are tired to talking about this and want to just go
>> off and implement something, then good luck (and I do mean that, it
>> isn't sarcasm).
>
>
> I do want to constructively engage the community at the same time as I do
the implementation, and I have a track record of producing good interfaces
even when the underlying functionality is complex. I've had very positive
feedback about einsum from people who deal with multiple arrays of
multidimensional data and were missing an easy way to do that kind of
operation.
>
>> But as far as I can tell right now, every single person who has
>> experience with handling missing data for statistical purposes (esp.
>> in R) has real concerns about your proposal, and AFAICT the community
>> has very much *not* reached consensus on how these features should
>> look. So I guess my question is, once you've spent your limited time
>> on writing this code -- how confident are you that it will be merged?
>> This isn't a threat or anything, I have no power over what gets
>> merged, but -- it seems to me that there's a real chance that you'll
>> do this work and then it will go down in flames, or that it will be
>> merged and then the people you're trying to target will ignore it
>> anyway. This is why we try to build consensus first, right? I would
>> love to find some way to make everyone happy (and have been doing what
>> I can on that front), but right now I am not happy, other people are
>> not happy, and you're communicating that you don't think that matters.
>> I'd love for that to change.
>
>
> Building consensus is general virtually impossible, I'm for example very
impressed with the C++ standards committee's success in achieving it where
they have. My development process is different from what you're describing,
Like with datetime, I am merging periodically, not doing one big merge at
the end. There's a reason why design by committee is frowned upon. The
feedback is great, but still needs to go through a very strict software
design quality filter.
>
> -Mark
>
>>
>>
>> -- Nathaniel
>>
>> [1]
http://mail.scipy.org/pipermail/numpy-discussion/2011-June/057274.html
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion at scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20110701/e3952b7c/attachment.html>


More information about the NumPy-Discussion mailing list