[Numpy-discussion] What is consensus anyway

josef.pktd at gmail.com josef.pktd at gmail.com
Tue Apr 24 15:19:47 EDT 2012


On Tue, Apr 24, 2012 at 2:35 PM, Benjamin Root <ben.root at ou.edu> wrote:
> On Tue, Apr 24, 2012 at 2:12 PM, Charles R Harris
> <charlesr.harris at gmail.com> wrote:
>>
>>
>>
>> On Tue, Apr 24, 2012 at 9:25 AM, <josef.pktd at gmail.com> wrote:
>>>
>>> On Tue, Apr 24, 2012 at 9:43 AM, Pierre Haessig
>>> <pierre.haessig at crans.org> wrote:
>>> > Hi,
>>> >
>>> > Le 24/04/2012 15:14, Charles R Harris a écrit :
>>> >>
>>> >> a) All arrays should be implicitly masked, even if the mask isn't
>>> >> initially allocated. The maskna keyword can then be removed, taking
>>> >> with it the sense that there are two kinds of arrays.
>>> >>
>>> >
>>> > From my lazy user perspective, having masked and non-masked arrays
>>> > share
>>> > the same "look and feel" would be a number one advantage over the
>>> > existing numpy.ma arrays. I would like masked array to be as
>>> > transparent
>>> > as possible.
>>>
>>> I don't have any opinion about internal implementation.
>>>
>>> But users needs to be aware of whether they have masked arrays or not.
>>> Since many functions (most of scipy) wouldn't know how to handle NA
>>> and don't do any checks, (and shouldn't in my opinion if the NA check
>>> is costly). The result might be silently wrong numbers depending on
>>> the implementation.
>>
>>
>> There should be a flag saying whether or not NA has been allocated and
>> allocation happens when NA is assigned to an array item, so that should be
>> fast. I don't think scipy currently deals with masked arrays in all areas,,
>> so I believe that the same problem exists there and would also exist for
>> missing data types. I think this sort of compatibility problem is worth a
>> whole discussion by itself.
>>
>>>
>>>
>>> >
>>> >> b) There needs to be a distinction between missing and ignore. The
>>> >> mechanism for this is already in place in the payload type, although
>>> >> it isn't clear to me that that is uniformly used in all the NA code.
>>> >> There is also a place for missing *and* ignored. Which leads to
>>> >
>>> > If the idea of having two payloads is to avoid a maximum of "skipna &
>>> > friends" extra keywords, I would like it much. My feeling with my small
>>> > experience with R is that I end up calling every function with a
>>> > different magical set of keywords (na.rm, na.action, ... and I forgot).
>>>
>>> There is a reason for requiring the user to decide what to do about NA's.
>>> Either we have utility functions/methods to help the user change the
>>> arrays and treat NA's before calling a function, or the function needs
>>> to ask the user what should be done about possible NAs.
>>> Doing it automatically might only be useful for specialised packages.
>>>
>>
>> That's what the different payloads would do. I think the common use case
>> would always have the ignore bit set. What are the other sorts of actions
>> you are interested in, and should they be part of the functions in Numpy,
>> such as mean and std, or should they rather implemented in stats packages
>> that may be more specialized? I see numpy.ma currently used in the following
>> spots in scipy:

I think most functions that operate on an axis are mostly unambiguous
ignore, std, mean, var, histogram, should stay in numpy, np.cov might
have pairwise or row/column wise deletion option (but I don't know
what other packages are doing).

(While I had to run off, Nathaniel explained this.)

The main cases in stats (or statsmodels) for handling NaNs or NAs
would be rowwise ignore or pretend temporarily that they are zero or
some other neutral value.

>>
>
> Like you said, this whole issue probably should be in a separate discussion,
> but I would like to point out here with my thoughts on default payload.  If
> we don't have some sort of mechanism for flagging which functions are
> NA-friendly or not, then it would be wise to have NA default to NaN
> behavior.  If only to prevent bugs that mess up data from being undetected.

In scipy.stats it's currently the responsibility of the user, unless
explicitly mentioned that a function knows how to handle nans or
masked arrays, the default is "we don't check" and what you get
returned might be anything.

If there is a flag (and a cheap way to verify whether there are NaNs
or NAs), then we could just add a check in every function.

Josef

>
> That being said, the determination of NA payload is tricky.  Some functions
> may need to react differently to an NA.  One that comes to mind is
> np.gradient().  However, other functions may not need to do anything because
> they depend entirely upon other functions that have already been updated to
> support NA.
>
> Cheers!
> Ben Root
>
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>



More information about the NumPy-Discussion mailing list