[Numpy-discussion] Masking through generator arrays

Charles R Harris charlesr.harris at gmail.com
Thu May 10 04:40:59 EDT 2012


On Thu, May 10, 2012 at 1:10 AM, Dag Sverre Seljebotn <
d.s.seljebotn at astro.uio.no> wrote:

> On 05/10/2012 06:18 AM, Charles R Harris wrote:
> >
> >
> > On Wed, May 9, 2012 at 9:54 PM, Dag Sverre Seljebotn
> > <d.s.seljebotn at astro.uio.no <mailto:d.s.seljebotn at astro.uio.no>> wrote:
> >
> >     Sorry everyone for being so dense and contaminating that other
> thread.
> >     Here's a new thread where I can respond to Nathaniel's response.
> >
> >     On 05/10/2012 01:08 AM, Nathaniel Smith wrote:
> >      > Hi Dag,
> >      >
> >      > On Wed, May 9, 2012 at 8:44 PM, Dag Sverre Seljebotn
> >      > <d.s.seljebotn at astro.uio.no <mailto:d.s.seljebotn at astro.uio.no>>
> >       wrote:
> >      >> I'm a heavy user of masks, which are used to make data NA in the
> >      >> statistical sense. The setting is that we have to mask out the
> >     radiation
> >      >> coming from the Milky Way in full-sky images of the Cosmic
> Microwave
> >      >> Background. There's data, but we know we can't trust it, so we
> >     make it
> >      >> NA. But we also do play around with different masks.
> >      >
> >      > Oh, this is great -- that means you're one of the users that I
> wasn't
> >      > sure existed or not :-). Now I know!
> >      >
> >      >> Today we keep the mask in a seperate array, and to zero-mask we
> do
> >      >>
> >      >> masked_data = data * mask
> >      >>
> >      >> or
> >      >>
> >      >> masked_data = data.copy()
> >      >> masked_data[mask == 0] = np.nan # soon np.NA
> >      >>
> >      >> depending on the circumstances.
> >      >>
> >      >> Honestly, API-wise, this is as good as its gets for us. Nice and
> >      >> transparent, no new semantics to learn in the special case of
> masks.
> >      >>
> >      >> Now, this has performance issues: Lots of memory use, extra
> >     transfers
> >      >> over the memory bus.
> >      >
> >      > Right -- this is a case where (in the NA-overview terminology)
> masked
> >      > storage+NA semantics would be useful.
> >      >
> >      >> BUT, NumPy has that problem all over the place, even for "x + y
> >     + z"!
> >      >> Solving it in the special case of masks, by making a new API,
> >     seems a
> >      >> bit myopic to me.
> >      >>
> >      >> IMO, that's much better solved at the fundamental level. As an
> >      >> *illustration*:
> >      >>
> >      >> with np.lazy:
> >      >>      masked_data1 = data * mask1
> >      >>      masked_data2 = data * (mask1 | mask2)
> >      >>      masked_data3 = (x + y + z) * (mask1&  mask3)
> >      >>
> >      >> This would create three "generator arrays" that would zero-mask
> the
> >      >> arrays (and perform the three-term addition...) upon request.
> >     You could
> >      >> slice the generator arrays as you wish, and by that slice the
> >     data and
> >      >> the mask in one operation. Obviously this could handle
> >     NA-masking too.
> >      >>
> >      >> You can probably do this today with Theano and numexpr, and I
> think
> >      >> Travis mentioned that "generator arrays" are on his radar for
> core
> >     NumPy.
> >      >
> >      > Implementing this today would require some black magic hacks,
> because
> >      > on entry/exit to the context manager you'd have to "reach up"
> >     into the
> >      > calling scope and replace all the ndarray's with LazyArrays and
> then
> >      > vice-versa. This is actually totally possible:
> >      > https://gist.github.com/2347382
> >      > but I'm not sure I'd call it *wise*. (You could probably avoid the
> >      > truly horrible set_globals_dict part of that gist, though.) Might
> be
> >      > fun to prototype, though...
> >
> >     1) My main point was just that I believe masked arrays is something
> that
> >     to me feels immature, and that it is the kind of thing that should be
> >     constructed from simpler primitives. And that NumPy should focus on
> >     simple primitives. You could make it
> >
> >
> > I can't disagree, as I suggested the same as a possibility myself ;)
> > There is a lot of infrastructure now in numpy, but given the use cases
> > I'm tending towards the view that masked arrays should be left to
> > others, at least for the time being. The question is how to generalize
> > the infrastructure and what hooks to provide. I think just spending a
> > month or two pulling stuff out is counter productive, but evolving the
> > code is definitely needed. If you could familiarize yourself with what
> > is in there, something that seems largely neglected by the critics, and
> > make suggestions, that would be helpful.
>
> But how on earth can I make constructive criticisms about code when I
> don't know what the purpose of that code is supposed to be?
>

What do you mean? I thought the purpose was quite clearly laid out in the
NEP. But the implementation of that purpose required some infrastructure.
The point, I suppose, is for you to suggest what would serve your use case.


>
> Are you saying you agree that the masking aspect should be banned (or at
> least not "core"), and asking me to look at code from that perspective
> and comment on how to get there while keeping as much as possible of the
> rest? Would that really be helpful?
>

No, I don't agree that it should be banned, but your perspective seems to
be that it should be, so I ask you to determine what is worth keeping. We
can of course pull it all out and forget about the whole thing. But I'm
getting tired of people saying do this or that without making technical
suggestions that can be implemented, looking at the code, testing things,
and providing feedback. At a minimum, I expect you to have an idea of how
things *should* work and how to get there.

Chuck
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20120510/42292fac/attachment.html>


More information about the NumPy-Discussion mailing list