[Numpy-discussion] Masking through generator arrays
Charles R Harris
charlesr.harris at gmail.com
Thu May 10 00:18:44 EDT 2012
On Wed, May 9, 2012 at 9:54 PM, Dag Sverre Seljebotn <
d.s.seljebotn at astro.uio.no> wrote:
> Sorry everyone for being so dense and contaminating that other thread.
> Here's a new thread where I can respond to Nathaniel's response.
> On 05/10/2012 01:08 AM, Nathaniel Smith wrote:
> > Hi Dag,
> > On Wed, May 9, 2012 at 8:44 PM, Dag Sverre Seljebotn
> > <d.s.seljebotn at astro.uio.no> wrote:
> >> I'm a heavy user of masks, which are used to make data NA in the
> >> statistical sense. The setting is that we have to mask out the
> >> coming from the Milky Way in full-sky images of the Cosmic Microwave
> >> Background. There's data, but we know we can't trust it, so we make it
> >> NA. But we also do play around with different masks.
> > Oh, this is great -- that means you're one of the users that I wasn't
> > sure existed or not :-). Now I know!
> >> Today we keep the mask in a seperate array, and to zero-mask we do
> >> masked_data = data * mask
> >> or
> >> masked_data = data.copy()
> >> masked_data[mask == 0] = np.nan # soon np.NA
> >> depending on the circumstances.
> >> Honestly, API-wise, this is as good as its gets for us. Nice and
> >> transparent, no new semantics to learn in the special case of masks.
> >> Now, this has performance issues: Lots of memory use, extra transfers
> >> over the memory bus.
> > Right -- this is a case where (in the NA-overview terminology) masked
> > storage+NA semantics would be useful.
> >> BUT, NumPy has that problem all over the place, even for "x + y + z"!
> >> Solving it in the special case of masks, by making a new API, seems a
> >> bit myopic to me.
> >> IMO, that's much better solved at the fundamental level. As an
> >> *illustration*:
> >> with np.lazy:
> >> masked_data1 = data * mask1
> >> masked_data2 = data * (mask1 | mask2)
> >> masked_data3 = (x + y + z) * (mask1& mask3)
> >> This would create three "generator arrays" that would zero-mask the
> >> arrays (and perform the three-term addition...) upon request. You could
> >> slice the generator arrays as you wish, and by that slice the data and
> >> the mask in one operation. Obviously this could handle NA-masking too.
> >> You can probably do this today with Theano and numexpr, and I think
> >> Travis mentioned that "generator arrays" are on his radar for core
> > Implementing this today would require some black magic hacks, because
> > on entry/exit to the context manager you'd have to "reach up" into the
> > calling scope and replace all the ndarray's with LazyArrays and then
> > vice-versa. This is actually totally possible:
> > https://gist.github.com/2347382
> > but I'm not sure I'd call it *wise*. (You could probably avoid the
> > truly horrible set_globals_dict part of that gist, though.) Might be
> > fun to prototype, though...
> 1) My main point was just that I believe masked arrays is something that
> to me feels immature, and that it is the kind of thing that should be
> constructed from simpler primitives. And that NumPy should focus on
> simple primitives. You could make it
I can't disagree, as I suggested the same as a possibility myself ;) There
is a lot of infrastructure now in numpy, but given the use cases I'm
tending towards the view that masked arrays should be left to others, at
least for the time being. The question is how to generalize the
infrastructure and what hooks to provide. I think just spending a month or
two pulling stuff out is counter productive, but evolving the code is
definitely needed. If you could familiarize yourself with what is in there,
something that seems largely neglected by the critics, and make
suggestions, that would be helpful.
I'd also like to hear from Mark. It has been about 9 mos since he did the
work, and I'd be surprised if he didn't have ideas for doing some things
differently. OTOH, I can understand his reluctance to get involved in a
topic where I thought he was poorly treated last time around.
> np.gen.generating_multiply(data, mask)
> 2) About the with construct in particular, I intended "__enter__" and
> "__exit__" to only toggle a thread-local flag, and when that flag is in
> effect, "__mul__" would do a "generating_multiply" and return an
> ndarraygenerator rather than an ndarray.
> But of course, the amount of work is massive.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the NumPy-Discussion