[Numpy-discussion] missing data discussion round 2

Thu Jun 30 12:42:38 EDT 2011

Hi,

On Thu, Jun 30, 2011 at 5:13 PM, Mark Wiebe <mwwiebe at gmail.com> wrote:
> On Thu, Jun 30, 2011 at 11:04 AM, Gary Strangman
> <strang at nmr.mgh.harvard.edu> wrote:
>>
>>>      Clearly there are some overlaps between what masked arrays are
>>>      trying to achieve and what Rs NA mechanisms are trying to achieve.
>>>       Are they really similar enough that they should function using
>>>      the same API?
>>>
>>> Yes.
>>>
>>>      And if so, won't that be confusing?
>>>
>>> No, I don't believe so, any more than NA's in R, NaN's, or Inf's are
>>> already
>>> confusing.
>>
>> As one who's been silently following (most of) this thread, and a heavy R
>> and numpy user, perhaps I should chime in briefly here with a use case. I
>> more-or-less always work with partially masked data, like Matthew, but not
>> numpy masked arrays because the memory overhead is prohibitive. And, sad to
>> say, my experiments don't always go perfectly. I therefore have arrays in
>> which there is /both/ (1) data that is simply missing (np.NA?)--it never had
>> a value and never will--as well as simultaneously (2) data that that is
>> temporarily masked (np.IGNORE? np.MASKED?) where I want to mask/unmask
>> different portions for different purposes/analyses. I consider these two
>> separate, completely independent issues and I unfortunately currently have
>> to kluge a lot to handle this.
>>
>> Concretely, consider a list of 100,000 observations (rows), with 12
>> measures per observation-row (a 100,000 x 12 array). Every now and then,
>> sprinkled throughout this array, I have missing values (someone didn't
>> answer a question, or a computer failed to record a response, or whatever).
>> For some analyses I want to mask the whole row (e.g., complete-case
>> analysis), leaving me with array entries that should be tagged with all 4
>> possible labels:
>>
>> 1) not masked, not missing
>> 2) masked, not missing
>> 3) not masked, missing
>> 4) masked, missing
>>
>> Obviously #4 is "overkill" ... but only until I want to unmask that row.
>> At that point, I need to be sure that missing values remain missing when
>> unmasked. Can a single API really handle this?
>
> The single API does support a masked array with an NA dtype, and the
> behavior in this case will be that the value is considered NA if either it
> is masked or the value is the NA bit pattern. So you could add a mask to an
> array with an NA dtype to temporarily treat the data as if more values were
> missing.

Right - but I think the separated API is cleaner and easier to
explain.  Do you disagree?

> One important reason I'm doing it this way is so that each NumPy algorithm
> and any 3rd party code only needs to be updated once to support both forms
> of missing data.

Could you explain what you mean?  Maybe a couple of examples?

Whatever API results, it will surely be with us for a long time, and
so it would be good to make sure we have the right one even if it
costs a bit more to update current code.

Cheers,

Matthew