[Numpy-discussion] Missing data wrap-up and request for comments

Nathaniel Smith njs at pobox.com
Wed May 9 19:08:55 EDT 2012


Hi Dag,

On Wed, May 9, 2012 at 8:44 PM, Dag Sverre Seljebotn
<d.s.seljebotn at astro.uio.no> wrote:
> I'm a heavy user of masks, which are used to make data NA in the
> statistical sense. The setting is that we have to mask out the radiation
> coming from the Milky Way in full-sky images of the Cosmic Microwave
> Background. There's data, but we know we can't trust it, so we make it
> NA. But we also do play around with different masks.

Oh, this is great -- that means you're one of the users that I wasn't
sure existed or not :-). Now I know!

> Today we keep the mask in a seperate array, and to zero-mask we do
>
> masked_data = data * mask
>
> or
>
> masked_data = data.copy()
> masked_data[mask == 0] = np.nan # soon np.NA
>
> depending on the circumstances.
>
> Honestly, API-wise, this is as good as its gets for us. Nice and
> transparent, no new semantics to learn in the special case of masks.
>
> Now, this has performance issues: Lots of memory use, extra transfers
> over the memory bus.

Right -- this is a case where (in the NA-overview terminology) masked
storage+NA semantics would be useful.

> BUT, NumPy has that problem all over the place, even for "x + y + z"!
> Solving it in the special case of masks, by making a new API, seems a
> bit myopic to me.
>
> IMO, that's much better solved at the fundamental level. As an
> *illustration*:
>
> with np.lazy:
>     masked_data1 = data * mask1
>     masked_data2 = data * (mask1 | mask2)
>     masked_data3 = (x + y + z) * (mask1 & mask3)
>
> This would create three "generator arrays" that would zero-mask the
> arrays (and perform the three-term addition...) upon request. You could
> slice the generator arrays as you wish, and by that slice the data and
> the mask in one operation. Obviously this could handle NA-masking too.
>
> You can probably do this today with Theano and numexpr, and I think
> Travis mentioned that "generator arrays" are on his radar for core NumPy.

Implementing this today would require some black magic hacks, because
on entry/exit to the context manager you'd have to "reach up" into the
calling scope and replace all the ndarray's with LazyArrays and then
vice-versa. This is actually totally possible:
  https://gist.github.com/2347382
but I'm not sure I'd call it *wise*. (You could probably avoid the
truly horrible set_globals_dict part of that gist, though.) Might be
fun to prototype, though...

> Point is, as a user, I'm with Travis in having masks support go hide in
> ndmasked; they solve too much of a special case in a way that is too
> particular.

Right, that's the concern.

Hypothetical question: are you actually saying that if you had both
bitpattern NAs and Travis' "ndmasked" object, you would still go ahead
and use the bitpattern NAs for this case, because of the conceptual
simplicity, easy of Cython/C compatibility, etc.?

-- Nathaniel



More information about the NumPy-Discussion mailing list