[Numpy-discussion] missing data discussion round 2
Gary Strangman
strang at nmr.mgh.harvard.edu
Thu Jun 30 12:04:12 EDT 2011
> Clearly there are some overlaps between what masked arrays are
> trying to achieve and what Rs NA mechanisms are trying to achieve.
> Are they really similar enough that they should function using
> the same API?
>
> Yes.
>
> And if so, won't that be confusing?
>
> No, I don't believe so, any more than NA's in R, NaN's, or Inf's are already
> confusing.
As one who's been silently following (most of) this thread, and a heavy R
and numpy user, perhaps I should chime in briefly here with a use case. I
more-or-less always work with partially masked data, like Matthew, but not
numpy masked arrays because the memory overhead is prohibitive. And, sad
to say, my experiments don't always go perfectly. I therefore have arrays
in which there is /both/ (1) data that is simply missing (np.NA?)--it
never had a value and never will--as well as simultaneously (2) data that
that is temporarily masked (np.IGNORE? np.MASKED?) where I want to
mask/unmask different portions for different purposes/analyses. I consider
these two separate, completely independent issues and I unfortunately
currently have to kluge a lot to handle this.
Concretely, consider a list of 100,000 observations (rows), with 12
measures per observation-row (a 100,000 x 12 array). Every now and then,
sprinkled throughout this array, I have missing values (someone didn't
answer a question, or a computer failed to record a response, or
whatever). For some analyses I want to mask the whole row (e.g.,
complete-case analysis), leaving me with array entries that should be
tagged with all 4 possible labels:
1) not masked, not missing
2) masked, not missing
3) not masked, missing
4) masked, missing
Obviously #4 is "overkill" ... but only until I want to unmask that row.
At that point, I need to be sure that missing values remain missing when
unmasked. Can a single API really handle this?
-best
Gary
The information in this e-mail is intended only for the person to whom it is
addressed. If you believe this e-mail was sent to you in error and the e-mail
contains patient information, please contact the Partners Compliance HelpLine at
http://www.partners.org/complianceline . If the e-mail was sent to you in error
but does not contain patient information, please contact the sender and properly
dispose of the e-mail.
More information about the NumPy-Discussion
mailing list