Re: [Numpy-discussion] in the NA discussion, what can we agree on?
On 2011-11-03 04:22, numpy-discussion-request@scipy.org wrote:
Message: 1 Date: Wed, 2 Nov 2011 22:20:15 -0500 From: Benjamin Root
Subject: Re: [Numpy-discussion] in the NA discussion, what can we agree on? To: Discussion of Numerical Python Message-ID: Content-Type: text/plain; charset="iso-8859-1" On Wednesday, November 2, 2011, Nathaniel Smith
wrote: Hi Benjamin,
On Wed, Nov 2, 2011 at 5:25 PM, Benjamin Root
wrote: I want to pare this down even more. I think the above lists makes too many unneeded extrapolations. Okay. I found your formatting a little confusing, so I want to make sure I understood the changes you're suggesting:
For the description of what MISSING means, you removed the lines: - Compatibility with R is valuable - To avoid user confusion, ideally it should *not* be possible to 'unmask' a missing value, since this is inconsistent with the "missing value" metaphor (e.g., see Wes's comment about "leaky abstractions")
And you added the line: + Assigning MISSING is destructive
And for the description of what IGNORED means, you removed the lines: - Some memory overhead is inevitable and acceptable - Compatibility with R neither possible nor valuable - Ability to toggle the IGNORED state of a location is critical, and should be as convenient as possible
And you added the lines: + IGNORE is non-destructive + Must be competitive with np.ma for speed and memory (or else users would just use np.ma)
Is that right? Correct.
Assuming it is, my thoughts are:
By R compatibility, I specifically had in mind in-memory compatibility. rpy2 provides a more-or-less seamless within-process interface between R and Python (and specifically lets you get numpy views on arrays returned by R functions), so if we can make this work for R arrays containing NA too then that'd be handy. (The rpy2 author requested this in the last discussion here: http://mail.scipy.org/pipermail/numpy-discussion/2011-June/057084.html) When it comes to disk formats, then this doesn't matter so much, since IO routines have to translate between different representations all the time anyway.
Interesting, but I still have to wonder if that should be on the wishlist for MISSING. I guess it would matter by knowing whether people would be fully converting from R or gradually transitioning from it? That is something that I can't answer.
I probably do not have all possible use-cases but what I'd think of as the most common is: use R stuff just straight out of R from Python. Say that you are doing your work in Python and read about some statistical method for which an implementation in R exists (but not in Python/numpy). You can just pass your numpy arrays or vectors to the relevant R function(s) and retrieve the results in a form directly usable by numpy (without having the data copied around). Should performances become an issue, and that method be of crucial importance, you will probably want to reimplement it (C, or Cython, for example). Otherwise you could pick R's phenomenal toolbox without much effort and keep those calls to R as part of your code. In my experience, the later would be the most frequent. Get some compatibility for the NA "magic" values and that possible coupling between R and numpy becomes even better by preventing one side or the other to understand them as non-NA values.
I take the replacement of my line about MISSING disallowing unmasking and your line about MISSING assignment being destructive as basically expressing the same idea. Is that fair, or did you mean something else? I am someone who wants to get to the absolute core of ideas. Also, this expression cleanly delineates the differences as binary.
By expressing it this way, we also shy away from implementation details. For example, Unmasking can be programmatically prevented for MISSING while it could be implemented by other indirect means for IGNORE. Not that those are the preferred ways, only that the phrasing is more flexible and exacting.
Finally, do you think that people who want IGNORED support care about having a convenient API for masking/unmasking values? You removed that line, but I don't know if that was because you disagreed with it, or were just trying to simplify. See previous.
Then, as a third-party module developer, I can tell you that having separate and independent ways to detect "MISSING"/"IGNORED" would likely make support more difficult and would greatly benefit from a common (or easily combinable) method of identification. Right, sorry... I didn't forget, and that's part of what I was thinking when I described the second approach as keeping them as *mostly*-separate interfaces... but I should have made it more explicit! Anyway, yes:
4) There is consensus that whatever approach is taken, there should be a quick and convenient way to identify values that are MISSING, IGNORED, or both. (E.g., functions is_MISSING, is_IGNORED, is_MISSING_or_IGNORED, or some equivalent.)
Good.
Cheers! Ben Root
participants (1)
-
Laurent Gautier