[Numpy-discussion] Masking through generator arrays

Dag Sverre Seljebotn d.s.seljebotn at astro.uio.no
Thu May 10 10:48:21 EDT 2012


On 05/10/2012 11:38 AM, Dag Sverre Seljebotn wrote:
> On 05/10/2012 10:40 AM, Charles R Harris wrote:
>>
>>
>> On Thu, May 10, 2012 at 1:10 AM, Dag Sverre Seljebotn
>> <d.s.seljebotn at astro.uio.no<mailto:d.s.seljebotn at astro.uio.no>>  wrote:
>>
>>      On 05/10/2012 06:18 AM, Charles R Harris wrote:
>>       >
>>       >
>>       >  On Wed, May 9, 2012 at 9:54 PM, Dag Sverre Seljebotn
>>       >  <d.s.seljebotn at astro.uio.no<mailto:d.s.seljebotn at astro.uio.no>
>>      <mailto:d.s.seljebotn at astro.uio.no
>>      <mailto:d.s.seljebotn at astro.uio.no>>>  wrote:
>>       >
>>       >      Sorry everyone for being so dense and contaminating that
>>      other thread.
>>       >      Here's a new thread where I can respond to Nathaniel's response.
>>       >
>>       >      On 05/10/2012 01:08 AM, Nathaniel Smith wrote:
>>       >  >  Hi Dag,
>>       >  >
>>       >  >  On Wed, May 9, 2012 at 8:44 PM, Dag Sverre Seljebotn
>>       >  >  <d.s.seljebotn at astro.uio.no<mailto:d.s.seljebotn at astro.uio.no>
>>      <mailto:d.s.seljebotn at astro.uio.no<mailto:d.s.seljebotn at astro.uio.no>>>
>>       >        wrote:
>>       >  >>  I'm a heavy user of masks, which are used to make data NA in the
>>       >  >>  statistical sense. The setting is that we have to mask out the
>>       >      radiation
>>       >  >>  coming from the Milky Way in full-sky images of the Cosmic
>>      Microwave
>>       >  >>  Background. There's data, but we know we can't trust it, so we
>>       >      make it
>>       >  >>  NA. But we also do play around with different masks.
>>       >  >
>>       >  >  Oh, this is great -- that means you're one of the users that I
>>      wasn't
>>       >  >  sure existed or not :-). Now I know!
>>       >  >
>>       >  >>  Today we keep the mask in a seperate array, and to zero-mask we do
>>       >  >>
>>       >  >>  masked_data = data * mask
>>       >  >>
>>       >  >>  or
>>       >  >>
>>       >  >>  masked_data = data.copy()
>>       >  >>  masked_data[mask == 0] = np.nan # soon np.NA
>>       >  >>
>>       >  >>  depending on the circumstances.
>>       >  >>
>>       >  >>  Honestly, API-wise, this is as good as its gets for us. Nice and
>>       >  >>  transparent, no new semantics to learn in the special case of
>>      masks.
>>       >  >>
>>       >  >>  Now, this has performance issues: Lots of memory use, extra
>>       >      transfers
>>       >  >>  over the memory bus.
>>       >  >
>>       >  >  Right -- this is a case where (in the NA-overview terminology)
>>      masked
>>       >  >  storage+NA semantics would be useful.
>>       >  >
>>       >  >>  BUT, NumPy has that problem all over the place, even for "x + y
>>       >      + z"!
>>       >  >>  Solving it in the special case of masks, by making a new API,
>>       >      seems a
>>       >  >>  bit myopic to me.
>>       >  >>
>>       >  >>  IMO, that's much better solved at the fundamental level. As an
>>       >  >>  *illustration*:
>>       >  >>
>>       >  >>  with np.lazy:
>>       >  >>       masked_data1 = data * mask1
>>       >  >>       masked_data2 = data * (mask1 | mask2)
>>       >  >>       masked_data3 = (x + y + z) * (mask1&   mask3)
>>       >  >>
>>       >  >>  This would create three "generator arrays" that would
>>      zero-mask the
>>       >  >>  arrays (and perform the three-term addition...) upon request.
>>       >      You could
>>       >  >>  slice the generator arrays as you wish, and by that slice the
>>       >      data and
>>       >  >>  the mask in one operation. Obviously this could handle
>>       >      NA-masking too.
>>       >  >>
>>       >  >>  You can probably do this today with Theano and numexpr, and I
>>      think
>>       >  >>  Travis mentioned that "generator arrays" are on his radar for core
>>       >      NumPy.
>>       >  >
>>       >  >  Implementing this today would require some black magic hacks,
>>      because
>>       >  >  on entry/exit to the context manager you'd have to "reach up"
>>       >      into the
>>       >  >  calling scope and replace all the ndarray's with LazyArrays and
>>      then
>>       >  >  vice-versa. This is actually totally possible:
>>       >  >  https://gist.github.com/2347382
>>       >  >  but I'm not sure I'd call it *wise*. (You could probably avoid the
>>       >  >  truly horrible set_globals_dict part of that gist, though.)
>>      Might be
>>       >  >  fun to prototype, though...
>>       >
>>       >      1) My main point was just that I believe masked arrays is
>>      something that
>>       >      to me feels immature, and that it is the kind of thing that
>>      should be
>>       >      constructed from simpler primitives. And that NumPy should
>>      focus on
>>       >      simple primitives. You could make it
>>       >
>>       >
>>       >  I can't disagree, as I suggested the same as a possibility myself ;)
>>       >  There is a lot of infrastructure now in numpy, but given the use
>>      cases
>>       >  I'm tending towards the view that masked arrays should be left to
>>       >  others, at least for the time being. The question is how to
>>      generalize
>>       >  the infrastructure and what hooks to provide. I think just spending a
>>       >  month or two pulling stuff out is counter productive, but
>>      evolving the
>>       >  code is definitely needed. If you could familiarize yourself with
>>      what
>>       >  is in there, something that seems largely neglected by the
>>      critics, and
>>       >  make suggestions, that would be helpful.
>>
>>      But how on earth can I make constructive criticisms about code when I
>>      don't know what the purpose of that code is supposed to be?
>>
>>
>> What do you mean? I thought the purpose was quite clearly laid out in
>> the NEP. But the implementation of that purpose required some
>> infrastructure. The point, I suppose, is for you to suggest what would
>> serve your use case.
>
> What would serve me? I use NumPy as a glorified "double*". And I'm
> probably going to continue doing so. I'd be most happy by the status quo
> of NumPy 1.6 + bugfixes.
>
> Main reason I'm getting involved is that the direction NumPy is going
> has consequences for Cython and Cython users and the Cython code I write
> myself.
>
> So my main comment is that I'd like (by whatever means necessary) to
> have PyObject_TypeCheck fail whenever you can't sanely access
> PyArray_DATA or in other ways use the good old NumPy C API (for whatever
> reason, such as masks).
>
> If PyObject_TypeCheck fails, existing C extension modules are
> forward-compatible, and so one doesn't have to go over old extension
> module code to add in a check for the presence of masks (or whatever)
> and raise an exception.
>
> Suggestion: Use the PyArrayObject struct for both ndarray and ndmasked,
> so that ndarray has NULL mask fields. So ob_type makes them appear
> different to Python and C extensions, but internally in NumPy you could
> just check for a NULL mask field rather than the type, if you prefer that.
>
> (When I say I like status quo: There are things I'd *like* to see
> changed in NumPy, but it's better phrased in terms of full
> reimplementation rather than iterative refinement, which probably means
> I should keep those thoughts to myself until I grow up. But until NumPy
> changes on a pretty fundamental level, all I want is my glorified
> "double*". I'm probably not a representative user.)
>
>>      Are you saying you agree that the masking aspect should be banned (or at
>>      least not "core"), and asking me to look at code from that perspective
>>      and comment on how to get there while keeping as much as possible of the
>>      rest? Would that really be helpful?
>>
>>
>> No, I don't agree that it should be banned, but your perspective seems
>> to be that it should be, so I ask you to determine what is worth
>> keeping. We can of course pull it all out and forget about the whole
>> thing. But I'm getting tired of people saying do this or that without
>> making technical suggestions that can be implemented, looking at the
>> code, testing things, and providing feedback. At a minimum, I expect you
>> to have an idea of how things *should* work and how to get there.

Chuck -- I apologize for my behaviour. I got aggravated by your reply, 
but that doesn't excuse the form of my answer. Let me try to answer again.

Listen: If you can explain to me *how* I would look at the source code 
-- what purpose I should have in mind while I do it -- then I'll do it.

Usually when I look at source code it's because I need to fix bug X, or 
implement feature Y. I have a clear purpose for opening up my editor.

But it's unclear to me how me simply looking at source code would help 
inform the discussion we're having here. If you can explain that to me 
(please spoon-feed me), I'm happy.

I could go through just to see how to make it how I'd personally prefer 
it, without regard to the opinion of others. But with overwhelming 
probability, what would come out of that is just what I've already said 
("rip masks out"), just stated in a more verbose form (the steps needed 
to rip them out). I don't see how that would help anybody. (Is it just 
that you disbelieve my future-looking statement here?)

There are other usecases than mine. But reading the source code doesn't 
inform me about those usecases. So either somebody else than me must 
comment on the source code, or I must gain a much better understanding 
of the motivation for masks than what I have today.

If you still want me to do what you say, just give me a better 
understanding of how I go about it, and that it'll be a valuable thing 
for NumPy that I do it, and I'll set aside some hours.

Dag



More information about the NumPy-Discussion mailing list