Mailman 3 Re: [Numpy-discussion] in the NA discussion, what can we agree on? - NumPy-Discussion

3 Nov 2011

      On 2011-11-03 04:22, numpy-discussion-request@scipy.org wrote:
...
Message: 1
Date: Wed, 2 Nov 2011 22:20:15 -0500
From: Benjamin Root
Subject: Re: [Numpy-discussion] in the NA discussion, what can we
  agree on?
To: Discussion of Numerical Python
Message-ID:

Content-Type: text/plain; charset="iso-8859-1"
On Wednesday, November 2, 2011, Nathaniel Smith  wrote:
...
Hi Benjamin,
On Wed, Nov 2, 2011 at 5:25 PM, Benjamin Root  wrote:
...
I want to pare this down even more.  I think the above lists makes too
many
unneeded extrapolations.
Okay. I found your formatting a little confusing, so I want to make
sure I understood the changes you're suggesting:
For the description of what MISSING means, you removed the lines:
- Compatibility with R is valuable
- To avoid user confusion, ideally it should *not* be possible to
'unmask' a missing value, since this is inconsistent with the "missing
value" metaphor (e.g., see Wes's comment about "leaky abstractions")
And you added the line:
+ Assigning MISSING is destructive
And for the description of what IGNORED means, you removed the lines:
- Some memory overhead is inevitable and acceptable
- Compatibility with R neither possible nor valuable
- Ability to toggle the IGNORED state of a location is critical, and
should be as convenient as possible
And you added the lines:
+ IGNORE is non-destructive
+ Must be competitive with np.ma for speed and memory (or else users
would just use np.ma)
Is that right?
Correct.
...
Assuming it is, my thoughts are:
By R compatibility, I specifically had in mind in-memory
compatibility. rpy2 provides a more-or-less seamless within-process
interface between R and Python (and specifically lets you get numpy
views on arrays returned by R functions), so if we can make this work
for R arrays containing NA too then that'd be handy. (The rpy2 author
requested this in the last discussion here:
http://mail.scipy.org/pipermail/numpy-discussion/2011-June/057084.html)
When it comes to disk formats, then this doesn't matter so much, since
IO routines have to translate between different representations all
the time anyway.
Interesting, but I still have to wonder if that should be on the wishlist
for MISSING.  I guess it would matter by knowing whether people would be
fully converting from R or gradually transitioning from it?  That is
something that I can't answer.
I probably do not have all possible use-cases but what I'd think of as 
the most common is: use R stuff just straight out of R from Python. Say 
that you are doing your work in Python and read about some statistical 
method for which an implementation in R exists (but not in 
Python/numpy). You can just pass your numpy arrays or vectors to the 
relevant R function(s) and retrieve the results in a form directly 
usable by numpy (without having the data copied around). Should 
performances become an issue, and that method be of crucial importance, 
you will probably want to reimplement it (C, or Cython, for example). 
Otherwise you could pick R's phenomenal toolbox without much effort and 
keep those calls to R as part of your code.

In my experience, the later would be the most frequent.

Get some compatibility for the NA "magic" values and that possible 
coupling between R and numpy becomes even better by preventing one side 
or the other to understand them as non-NA values.
...
...
I take the replacement of my line about MISSING disallowing unmasking
and your line about MISSING assignment being destructive as basically
expressing the same idea. Is that fair, or did you mean something
else?
I am someone who wants to get to the absolute core of ideas. Also, this
expression cleanly delineates the differences as binary.
By expressing it this way, we also shy away from implementation details.
For example, Unmasking can be programmatically prevented for MISSING while
it could be implemented by other indirect means for IGNORE. Not that those
are the preferred ways, only that the phrasing is more flexible and
exacting.
...
Finally, do you think that people who want IGNORED support care about
having a convenient API for masking/unmasking values? You removed that
line, but I don't know if that was because you disagreed with it, or
were just trying to simplify.
See previous.
...
...
Then, as a third-party module developer, I can tell you that having
separate
and independent ways to detect "MISSING"/"IGNORED" would likely make
support
more difficult and would greatly benefit from a common (or easily
combinable) method of identification.
Right, sorry... I didn't forget, and that's part of what I was
thinking when I described the second approach as keeping them as
*mostly*-separate interfaces... but I should have made it more
explicit! Anyway, yes:
4) There is consensus that whatever approach is taken, there should be
a quick and convenient way to identify values that are MISSING,
IGNORED, or both. (E.g., functions is_MISSING, is_IGNORED,
is_MISSING_or_IGNORED, or some equivalent.)
Good.
Cheers!
Ben Root

Re: [Numpy-discussion] in the NA discussion, what can we agree on?

Laurent Gautier

tags

participants (1)