[Numpy-discussion] Masking through generator arrays

Thu May 10 05:46:21 EDT 2012

On 05/10/2012 11:38 AM, Dag Sverre Seljebotn wrote:
> On 05/10/2012 10:40 AM, Charles R Harris wrote:
>>
>>
>> On Thu, May 10, 2012 at 1:10 AM, Dag Sverre Seljebotn
>> <d.s.seljebotn at astro.uio.no<mailto:d.s.seljebotn at astro.uio.no>>  wrote:
>>
>>      On 05/10/2012 06:18 AM, Charles R Harris wrote:
>>       >
>>       >
>>       >  On Wed, May 9, 2012 at 9:54 PM, Dag Sverre Seljebotn
>>       >  <d.s.seljebotn at astro.uio.no<mailto:d.s.seljebotn at astro.uio.no>
>>      <mailto:d.s.seljebotn at astro.uio.no
>>      <mailto:d.s.seljebotn at astro.uio.no>>>  wrote:
>>       >
>>       >      Sorry everyone for being so dense and contaminating that
>>      other thread.
>>       >      Here's a new thread where I can respond to Nathaniel's response.
>>       >
>>       >      On 05/10/2012 01:08 AM, Nathaniel Smith wrote:
>>       >  >  Hi Dag,
>>       >  >
>>       >  >  On Wed, May 9, 2012 at 8:44 PM, Dag Sverre Seljebotn
>>       >  >  <d.s.seljebotn at astro.uio.no<mailto:d.s.seljebotn at astro.uio.no>
>>      <mailto:d.s.seljebotn at astro.uio.no<mailto:d.s.seljebotn at astro.uio.no>>>
>>       >        wrote:
>>       >  >>  I'm a heavy user of masks, which are used to make data NA in the
>>       >  >>  statistical sense. The setting is that we have to mask out the
>>       >      radiation
>>       >  >>  coming from the Milky Way in full-sky images of the Cosmic
>>      Microwave
>>       >  >>  Background. There's data, but we know we can't trust it, so we
>>       >      make it
>>       >  >>  NA. But we also do play around with different masks.
>>       >  >
>>       >  >  Oh, this is great -- that means you're one of the users that I
>>      wasn't
>>       >  >  sure existed or not :-). Now I know!
>>       >  >
>>       >  >>  Today we keep the mask in a seperate array, and to zero-mask we do
>>       >  >>
>>       >  >>  masked_data = data * mask
>>       >  >>
>>       >  >>  or
>>       >  >>
>>       >  >>  masked_data = data.copy()
>>       >  >>  masked_data[mask == 0] = np.nan # soon np.NA
>>       >  >>
>>       >  >>  depending on the circumstances.
>>       >  >>
>>       >  >>  Honestly, API-wise, this is as good as its gets for us. Nice and
>>       >  >>  transparent, no new semantics to learn in the special case of
>>      masks.
>>       >  >>
>>       >  >>  Now, this has performance issues: Lots of memory use, extra
>>       >      transfers
>>       >  >>  over the memory bus.
>>       >  >
>>       >  >  Right -- this is a case where (in the NA-overview terminology)
>>      masked
>>       >  >  storage+NA semantics would be useful.
>>       >  >
>>       >  >>  BUT, NumPy has that problem all over the place, even for "x + y
>>       >      + z"!
>>       >  >>  Solving it in the special case of masks, by making a new API,
>>       >      seems a
>>       >  >>  bit myopic to me.
>>       >  >>
>>       >  >>  IMO, that's much better solved at the fundamental level. As an
>>       >  >>  *illustration*:
>>       >  >>
>>       >  >>  with np.lazy:
>>       >  >>       masked_data1 = data * mask1
>>       >  >>       masked_data2 = data * (mask1 | mask2)
>>       >  >>       masked_data3 = (x + y + z) * (mask1&   mask3)
>>       >  >>
>>       >  >>  This would create three "generator arrays" that would
>>      zero-mask the
>>       >  >>  arrays (and perform the three-term addition...) upon request.
>>       >      You could
>>       >  >>  slice the generator arrays as you wish, and by that slice the
>>       >      data and
>>       >  >>  the mask in one operation. Obviously this could handle
>>       >      NA-masking too.
>>       >  >>
>>       >  >>  You can probably do this today with Theano and numexpr, and I
>>      think
>>       >  >>  Travis mentioned that "generator arrays" are on his radar for core
>>       >      NumPy.
>>       >  >
>>       >  >  Implementing this today would require some black magic hacks,
>>      because
>>       >  >  on entry/exit to the context manager you'd have to "reach up"
>>       >      into the
>>       >  >  calling scope and replace all the ndarray's with LazyArrays and
>>      then
>>       >  >  vice-versa. This is actually totally possible:
>>       >  >  https://gist.github.com/2347382
>>       >  >  but I'm not sure I'd call it *wise*. (You could probably avoid the
>>       >  >  truly horrible set_globals_dict part of that gist, though.)
>>      Might be
>>       >  >  fun to prototype, though...
>>       >
>>       >      1) My main point was just that I believe masked arrays is
>>      something that
>>       >      to me feels immature, and that it is the kind of thing that
>>      should be
>>       >      constructed from simpler primitives. And that NumPy should
>>      focus on
>>       >      simple primitives. You could make it
>>       >
>>       >
>>       >  I can't disagree, as I suggested the same as a possibility myself ;)
>>       >  There is a lot of infrastructure now in numpy, but given the use
>>      cases
>>       >  I'm tending towards the view that masked arrays should be left to
>>       >  others, at least for the time being. The question is how to
>>      generalize
>>       >  the infrastructure and what hooks to provide. I think just spending a
>>       >  month or two pulling stuff out is counter productive, but
>>      evolving the
>>       >  code is definitely needed. If you could familiarize yourself with
>>      what
>>       >  is in there, something that seems largely neglected by the
>>      critics, and
>>       >  make suggestions, that would be helpful.
>>
>>      But how on earth can I make constructive criticisms about code when I
>>      don't know what the purpose of that code is supposed to be?
>>
>>
>> What do you mean? I thought the purpose was quite clearly laid out in
>> the NEP. But the implementation of that purpose required some
>> infrastructure. The point, I suppose, is for you to suggest what would
>> serve your use case.
>
> What would serve me? I use NumPy as a glorified "double*". And I'm
> probably going to continue doing so. I'd be most happy by the status quo
> of NumPy 1.6 + bugfixes.
>
> Main reason I'm getting involved is that the direction NumPy is going
> has consequences for Cython and Cython users and the Cython code I write
> myself.
>
> So my main comment is that I'd like (by whatever means necessary) to
> have PyObject_TypeCheck fail whenever you can't sanely access
> PyArray_DATA or in other ways use the good old NumPy C API (for whatever
> reason, such as masks).
>
> If PyObject_TypeCheck fails, existing C extension modules are
> forward-compatible, and so one doesn't have to go over old extension
> module code to add in a check for the presence of masks (or whatever)
> and raise an exception.
>
> Suggestion: Use the PyArrayObject struct for both ndarray and ndmasked,
> so that ndarray has NULL mask fields. So ob_type makes them appear
> different to Python and C extensions, but internally in NumPy you could
> just check for a NULL mask field rather than the type, if you prefer that.
>
> (When I say I like status quo: There are things I'd *like* to see
> changed in NumPy, but it's better phrased in terms of full
> reimplementation rather than iterative refinement, which probably means
> I should keep those thoughts to myself until I grow up. But until NumPy
> changes on a pretty fundamental level, all I want is my glorified
> "double*". I'm probably not a representative user.)
>
>>      Are you saying you agree that the masking aspect should be banned (or at
>>      least not "core"), and asking me to look at code from that perspective
>>      and comment on how to get there while keeping as much as possible of the
>>      rest? Would that really be helpful?
>>
>>
>> No, I don't agree that it should be banned, but your perspective seems
>> to be that it should be, so I ask you to determine what is worth
>> keeping. We can of course pull it all out and forget about the whole
>> thing. But I'm getting tired of people saying do this or that without
>> making technical suggestions that can be implemented, looking at the
>> code, testing things, and providing feedback. At a minimum, I expect you
>> to have an idea of how things *should* work and how to get there.
>
> Look, I'm used to coding; at least I used to spend a significant amount
> of time on Cython, doing some pretty extensive refactorings of the code
> base that touched pretty much all the code.
>
> And on the Cython list, we almost always keep details of implementation
> second-order to the semantics we want. We do keep an eye on "how
> difficult it would be to implement" while discussing semantics, and
> sometimes have to say "we ideally want this long-term, what's the
> stopgap solution we can use until we can make it work the way we want".
> But it's nothing like what you suggest above. And we don't compromise
> short-term if better semantics are available long-term (e.g., recent
> thread on None-checking).
>
> But talking about process is getting old, I guess.
>
> (Of course there's plenty of other factors involved that could explain
> why the Cython/SymPy/Sage/etc. development communities are so good, and
> the NumPy one so...ehrmm..)

I'm very sorry. I should have said "lists"! I don't really know anything 
about the NumPy developer community.

Anyway, I sense my best course of action will be to shut up.

Dag