[Numpy-discussion] Masking through generator arrays

Dag Sverre Seljebotn d.s.seljebotn at astro.uio.no
Wed May 9 23:54:11 EDT 2012


Sorry everyone for being so dense and contaminating that other thread. 
Here's a new thread where I can respond to Nathaniel's response.

On 05/10/2012 01:08 AM, Nathaniel Smith wrote:
 > Hi Dag,
 >
 > On Wed, May 9, 2012 at 8:44 PM, Dag Sverre Seljebotn
 > <d.s.seljebotn at astro.uio.no>  wrote:
 >> I'm a heavy user of masks, which are used to make data NA in the
 >> statistical sense. The setting is that we have to mask out the radiation
 >> coming from the Milky Way in full-sky images of the Cosmic Microwave
 >> Background. There's data, but we know we can't trust it, so we make it
 >> NA. But we also do play around with different masks.
 >
 > Oh, this is great -- that means you're one of the users that I wasn't
 > sure existed or not :-). Now I know!
 >
 >> Today we keep the mask in a seperate array, and to zero-mask we do
 >>
 >> masked_data = data * mask
 >>
 >> or
 >>
 >> masked_data = data.copy()
 >> masked_data[mask == 0] = np.nan # soon np.NA
 >>
 >> depending on the circumstances.
 >>
 >> Honestly, API-wise, this is as good as its gets for us. Nice and
 >> transparent, no new semantics to learn in the special case of masks.
 >>
 >> Now, this has performance issues: Lots of memory use, extra transfers
 >> over the memory bus.
 >
 > Right -- this is a case where (in the NA-overview terminology) masked
 > storage+NA semantics would be useful.
 >
 >> BUT, NumPy has that problem all over the place, even for "x + y + z"!
 >> Solving it in the special case of masks, by making a new API, seems a
 >> bit myopic to me.
 >>
 >> IMO, that's much better solved at the fundamental level. As an
 >> *illustration*:
 >>
 >> with np.lazy:
 >>      masked_data1 = data * mask1
 >>      masked_data2 = data * (mask1 | mask2)
 >>      masked_data3 = (x + y + z) * (mask1&  mask3)
 >>
 >> This would create three "generator arrays" that would zero-mask the
 >> arrays (and perform the three-term addition...) upon request. You could
 >> slice the generator arrays as you wish, and by that slice the data and
 >> the mask in one operation. Obviously this could handle NA-masking too.
 >>
 >> You can probably do this today with Theano and numexpr, and I think
 >> Travis mentioned that "generator arrays" are on his radar for core 
NumPy.
 >
 > Implementing this today would require some black magic hacks, because
 > on entry/exit to the context manager you'd have to "reach up" into the
 > calling scope and replace all the ndarray's with LazyArrays and then
 > vice-versa. This is actually totally possible:
 >    https://gist.github.com/2347382
 > but I'm not sure I'd call it *wise*. (You could probably avoid the
 > truly horrible set_globals_dict part of that gist, though.) Might be
 > fun to prototype, though...

1) My main point was just that I believe masked arrays is something that 
to me feels immature, and that it is the kind of thing that should be 
constructed from simpler primitives. And that NumPy should focus on 
simple primitives. You could make it

np.gen.generating_multiply(data, mask)

2) About the with construct in particular, I intended "__enter__" and 
"__exit__" to only toggle a thread-local flag, and when that flag is in 
effect, "__mul__" would do a "generating_multiply" and return an 
ndarraygenerator rather than an ndarray.

But of course, the amount of work is massive.

 >
 >> Point is, as a user, I'm with Travis in having masks support go hide in
 >> ndmasked; they solve too much of a special case in a way that is too
 >> particular.
 >
 > Right, that's the concern.
 >
 > Hypothetical question: are you actually saying that if you had both
 > bitpattern NAs and Travis' "ndmasked" object, you would still go ahead
 > and use the bitpattern NAs for this case, because of the conceptual
 > simplicity, easy of Cython/C compatibility, etc.?

For sure. But that's just one data point...

I'd do either a) destroying the input data by overwriting with NA, or b) 
pass the mask separately.

However, I don't do much slicing. b) gets tiresome if you need to slice 
and dice your arrays, and you don't have enough memory to do a). In that 
case I might be tempted to use "the NEP", but I might also write my own 
class containing a data array and a mask array that's purposed to the 
task at hand... I don't know, since I don't do much slicing on the 
arrays I happen to mask.

I've basically been wanting for this issue to die as quickly as 
possible, so that I could ignore it and the community move on to other 
issues. But now I think I've come around a position where I actually 
care that this doesn't make it into ndarray, in particular if the 
intention is to put some pressure on C extension writers to support 
this, rather than just saying that masked arrays don't work with most C 
extensions.

Thanks a lot Nathaniel and Matthew and others for taking the fight.

Dag



More information about the NumPy-Discussion mailing list