[Numpy-discussion] Masking through generator arrays

Dag Sverre Seljebotn d.s.seljebotn at astro.uio.no
Thu May 10 05:38:19 EDT 2012


On 05/10/2012 10:40 AM, Charles R Harris wrote:
>
>
> On Thu, May 10, 2012 at 1:10 AM, Dag Sverre Seljebotn
> <d.s.seljebotn at astro.uio.no <mailto:d.s.seljebotn at astro.uio.no>> wrote:
>
>     On 05/10/2012 06:18 AM, Charles R Harris wrote:
>      >
>      >
>      > On Wed, May 9, 2012 at 9:54 PM, Dag Sverre Seljebotn
>      > <d.s.seljebotn at astro.uio.no <mailto:d.s.seljebotn at astro.uio.no>
>     <mailto:d.s.seljebotn at astro.uio.no
>     <mailto:d.s.seljebotn at astro.uio.no>>> wrote:
>      >
>      >     Sorry everyone for being so dense and contaminating that
>     other thread.
>      >     Here's a new thread where I can respond to Nathaniel's response.
>      >
>      >     On 05/10/2012 01:08 AM, Nathaniel Smith wrote:
>      > > Hi Dag,
>      > >
>      > > On Wed, May 9, 2012 at 8:44 PM, Dag Sverre Seljebotn
>      > > <d.s.seljebotn at astro.uio.no <mailto:d.s.seljebotn at astro.uio.no>
>     <mailto:d.s.seljebotn at astro.uio.no <mailto:d.s.seljebotn at astro.uio.no>>>
>      >       wrote:
>      > >> I'm a heavy user of masks, which are used to make data NA in the
>      > >> statistical sense. The setting is that we have to mask out the
>      >     radiation
>      > >> coming from the Milky Way in full-sky images of the Cosmic
>     Microwave
>      > >> Background. There's data, but we know we can't trust it, so we
>      >     make it
>      > >> NA. But we also do play around with different masks.
>      > >
>      > > Oh, this is great -- that means you're one of the users that I
>     wasn't
>      > > sure existed or not :-). Now I know!
>      > >
>      > >> Today we keep the mask in a seperate array, and to zero-mask we do
>      > >>
>      > >> masked_data = data * mask
>      > >>
>      > >> or
>      > >>
>      > >> masked_data = data.copy()
>      > >> masked_data[mask == 0] = np.nan # soon np.NA
>      > >>
>      > >> depending on the circumstances.
>      > >>
>      > >> Honestly, API-wise, this is as good as its gets for us. Nice and
>      > >> transparent, no new semantics to learn in the special case of
>     masks.
>      > >>
>      > >> Now, this has performance issues: Lots of memory use, extra
>      >     transfers
>      > >> over the memory bus.
>      > >
>      > > Right -- this is a case where (in the NA-overview terminology)
>     masked
>      > > storage+NA semantics would be useful.
>      > >
>      > >> BUT, NumPy has that problem all over the place, even for "x + y
>      >     + z"!
>      > >> Solving it in the special case of masks, by making a new API,
>      >     seems a
>      > >> bit myopic to me.
>      > >>
>      > >> IMO, that's much better solved at the fundamental level. As an
>      > >> *illustration*:
>      > >>
>      > >> with np.lazy:
>      > >>      masked_data1 = data * mask1
>      > >>      masked_data2 = data * (mask1 | mask2)
>      > >>      masked_data3 = (x + y + z) * (mask1&  mask3)
>      > >>
>      > >> This would create three "generator arrays" that would
>     zero-mask the
>      > >> arrays (and perform the three-term addition...) upon request.
>      >     You could
>      > >> slice the generator arrays as you wish, and by that slice the
>      >     data and
>      > >> the mask in one operation. Obviously this could handle
>      >     NA-masking too.
>      > >>
>      > >> You can probably do this today with Theano and numexpr, and I
>     think
>      > >> Travis mentioned that "generator arrays" are on his radar for core
>      >     NumPy.
>      > >
>      > > Implementing this today would require some black magic hacks,
>     because
>      > > on entry/exit to the context manager you'd have to "reach up"
>      >     into the
>      > > calling scope and replace all the ndarray's with LazyArrays and
>     then
>      > > vice-versa. This is actually totally possible:
>      > > https://gist.github.com/2347382
>      > > but I'm not sure I'd call it *wise*. (You could probably avoid the
>      > > truly horrible set_globals_dict part of that gist, though.)
>     Might be
>      > > fun to prototype, though...
>      >
>      >     1) My main point was just that I believe masked arrays is
>     something that
>      >     to me feels immature, and that it is the kind of thing that
>     should be
>      >     constructed from simpler primitives. And that NumPy should
>     focus on
>      >     simple primitives. You could make it
>      >
>      >
>      > I can't disagree, as I suggested the same as a possibility myself ;)
>      > There is a lot of infrastructure now in numpy, but given the use
>     cases
>      > I'm tending towards the view that masked arrays should be left to
>      > others, at least for the time being. The question is how to
>     generalize
>      > the infrastructure and what hooks to provide. I think just spending a
>      > month or two pulling stuff out is counter productive, but
>     evolving the
>      > code is definitely needed. If you could familiarize yourself with
>     what
>      > is in there, something that seems largely neglected by the
>     critics, and
>      > make suggestions, that would be helpful.
>
>     But how on earth can I make constructive criticisms about code when I
>     don't know what the purpose of that code is supposed to be?
>
>
> What do you mean? I thought the purpose was quite clearly laid out in
> the NEP. But the implementation of that purpose required some
> infrastructure. The point, I suppose, is for you to suggest what would
> serve your use case.

What would serve me? I use NumPy as a glorified "double*". And I'm 
probably going to continue doing so. I'd be most happy by the status quo 
of NumPy 1.6 + bugfixes.

Main reason I'm getting involved is that the direction NumPy is going 
has consequences for Cython and Cython users and the Cython code I write 
myself.

So my main comment is that I'd like (by whatever means necessary) to 
have PyObject_TypeCheck fail whenever you can't sanely access 
PyArray_DATA or in other ways use the good old NumPy C API (for whatever 
reason, such as masks).

If PyObject_TypeCheck fails, existing C extension modules are 
forward-compatible, and so one doesn't have to go over old extension 
module code to add in a check for the presence of masks (or whatever) 
and raise an exception.

Suggestion: Use the PyArrayObject struct for both ndarray and ndmasked, 
so that ndarray has NULL mask fields. So ob_type makes them appear 
different to Python and C extensions, but internally in NumPy you could 
just check for a NULL mask field rather than the type, if you prefer that.

(When I say I like status quo: There are things I'd *like* to see 
changed in NumPy, but it's better phrased in terms of full 
reimplementation rather than iterative refinement, which probably means 
I should keep those thoughts to myself until I grow up. But until NumPy 
changes on a pretty fundamental level, all I want is my glorified 
"double*". I'm probably not a representative user.)

>     Are you saying you agree that the masking aspect should be banned (or at
>     least not "core"), and asking me to look at code from that perspective
>     and comment on how to get there while keeping as much as possible of the
>     rest? Would that really be helpful?
>
>
> No, I don't agree that it should be banned, but your perspective seems
> to be that it should be, so I ask you to determine what is worth
> keeping. We can of course pull it all out and forget about the whole
> thing. But I'm getting tired of people saying do this or that without
> making technical suggestions that can be implemented, looking at the
> code, testing things, and providing feedback. At a minimum, I expect you
> to have an idea of how things *should* work and how to get there.

Look, I'm used to coding; at least I used to spend a significant amount 
of time on Cython, doing some pretty extensive refactorings of the code 
base that touched pretty much all the code.

And on the Cython list, we almost always keep details of implementation 
second-order to the semantics we want. We do keep an eye on "how 
difficult it would be to implement" while discussing semantics, and 
sometimes have to say "we ideally want this long-term, what's the 
stopgap solution we can use until we can make it work the way we want". 
But it's nothing like what you suggest above. And we don't compromise 
short-term if better semantics are available long-term (e.g., recent 
thread on None-checking).

But talking about process is getting old, I guess.

(Of course there's plenty of other factors involved that could explain 
why the Cython/SymPy/Sage/etc. development communities are so good, and 
the NumPy one so...ehrmm..)

Dag



More information about the NumPy-Discussion mailing list