[Numpy-discussion] Masking through generator arrays
Dag Sverre Seljebotn
d.s.seljebotn at astro.uio.no
Wed May 9 23:54:11 EDT 2012
Sorry everyone for being so dense and contaminating that other thread.
Here's a new thread where I can respond to Nathaniel's response.
On 05/10/2012 01:08 AM, Nathaniel Smith wrote:
> Hi Dag,
>
> On Wed, May 9, 2012 at 8:44 PM, Dag Sverre Seljebotn
> <d.s.seljebotn at astro.uio.no> wrote:
>> I'm a heavy user of masks, which are used to make data NA in the
>> statistical sense. The setting is that we have to mask out the radiation
>> coming from the Milky Way in full-sky images of the Cosmic Microwave
>> Background. There's data, but we know we can't trust it, so we make it
>> NA. But we also do play around with different masks.
>
> Oh, this is great -- that means you're one of the users that I wasn't
> sure existed or not :-). Now I know!
>
>> Today we keep the mask in a seperate array, and to zero-mask we do
>>
>> masked_data = data * mask
>>
>> or
>>
>> masked_data = data.copy()
>> masked_data[mask == 0] = np.nan # soon np.NA
>>
>> depending on the circumstances.
>>
>> Honestly, API-wise, this is as good as its gets for us. Nice and
>> transparent, no new semantics to learn in the special case of masks.
>>
>> Now, this has performance issues: Lots of memory use, extra transfers
>> over the memory bus.
>
> Right -- this is a case where (in the NA-overview terminology) masked
> storage+NA semantics would be useful.
>
>> BUT, NumPy has that problem all over the place, even for "x + y + z"!
>> Solving it in the special case of masks, by making a new API, seems a
>> bit myopic to me.
>>
>> IMO, that's much better solved at the fundamental level. As an
>> *illustration*:
>>
>> with np.lazy:
>> masked_data1 = data * mask1
>> masked_data2 = data * (mask1 | mask2)
>> masked_data3 = (x + y + z) * (mask1& mask3)
>>
>> This would create three "generator arrays" that would zero-mask the
>> arrays (and perform the three-term addition...) upon request. You could
>> slice the generator arrays as you wish, and by that slice the data and
>> the mask in one operation. Obviously this could handle NA-masking too.
>>
>> You can probably do this today with Theano and numexpr, and I think
>> Travis mentioned that "generator arrays" are on his radar for core
NumPy.
>
> Implementing this today would require some black magic hacks, because
> on entry/exit to the context manager you'd have to "reach up" into the
> calling scope and replace all the ndarray's with LazyArrays and then
> vice-versa. This is actually totally possible:
> https://gist.github.com/2347382
> but I'm not sure I'd call it *wise*. (You could probably avoid the
> truly horrible set_globals_dict part of that gist, though.) Might be
> fun to prototype, though...
1) My main point was just that I believe masked arrays is something that
to me feels immature, and that it is the kind of thing that should be
constructed from simpler primitives. And that NumPy should focus on
simple primitives. You could make it
np.gen.generating_multiply(data, mask)
2) About the with construct in particular, I intended "__enter__" and
"__exit__" to only toggle a thread-local flag, and when that flag is in
effect, "__mul__" would do a "generating_multiply" and return an
ndarraygenerator rather than an ndarray.
But of course, the amount of work is massive.
>
>> Point is, as a user, I'm with Travis in having masks support go hide in
>> ndmasked; they solve too much of a special case in a way that is too
>> particular.
>
> Right, that's the concern.
>
> Hypothetical question: are you actually saying that if you had both
> bitpattern NAs and Travis' "ndmasked" object, you would still go ahead
> and use the bitpattern NAs for this case, because of the conceptual
> simplicity, easy of Cython/C compatibility, etc.?
For sure. But that's just one data point...
I'd do either a) destroying the input data by overwriting with NA, or b)
pass the mask separately.
However, I don't do much slicing. b) gets tiresome if you need to slice
and dice your arrays, and you don't have enough memory to do a). In that
case I might be tempted to use "the NEP", but I might also write my own
class containing a data array and a mask array that's purposed to the
task at hand... I don't know, since I don't do much slicing on the
arrays I happen to mask.
I've basically been wanting for this issue to die as quickly as
possible, so that I could ignore it and the community move on to other
issues. But now I think I've come around a position where I actually
care that this doesn't make it into ndarray, in particular if the
intention is to put some pressure on C extension writers to support
this, rather than just saying that masked arrays don't work with most C
extensions.
Thanks a lot Nathaniel and Matthew and others for taking the fight.
Dag
More information about the NumPy-Discussion
mailing list