[Numpy-discussion] Masking through generator arrays
Dag Sverre Seljebotn
d.s.seljebotn at astro.uio.no
Thu May 10 05:46:21 EDT 2012
On 05/10/2012 11:38 AM, Dag Sverre Seljebotn wrote:
> On 05/10/2012 10:40 AM, Charles R Harris wrote:
>>
>>
>> On Thu, May 10, 2012 at 1:10 AM, Dag Sverre Seljebotn
>> <d.s.seljebotn at astro.uio.no<mailto:d.s.seljebotn at astro.uio.no>> wrote:
>>
>> On 05/10/2012 06:18 AM, Charles R Harris wrote:
>> >
>> >
>> > On Wed, May 9, 2012 at 9:54 PM, Dag Sverre Seljebotn
>> > <d.s.seljebotn at astro.uio.no<mailto:d.s.seljebotn at astro.uio.no>
>> <mailto:d.s.seljebotn at astro.uio.no
>> <mailto:d.s.seljebotn at astro.uio.no>>> wrote:
>> >
>> > Sorry everyone for being so dense and contaminating that
>> other thread.
>> > Here's a new thread where I can respond to Nathaniel's response.
>> >
>> > On 05/10/2012 01:08 AM, Nathaniel Smith wrote:
>> > > Hi Dag,
>> > >
>> > > On Wed, May 9, 2012 at 8:44 PM, Dag Sverre Seljebotn
>> > > <d.s.seljebotn at astro.uio.no<mailto:d.s.seljebotn at astro.uio.no>
>> <mailto:d.s.seljebotn at astro.uio.no<mailto:d.s.seljebotn at astro.uio.no>>>
>> > wrote:
>> > >> I'm a heavy user of masks, which are used to make data NA in the
>> > >> statistical sense. The setting is that we have to mask out the
>> > radiation
>> > >> coming from the Milky Way in full-sky images of the Cosmic
>> Microwave
>> > >> Background. There's data, but we know we can't trust it, so we
>> > make it
>> > >> NA. But we also do play around with different masks.
>> > >
>> > > Oh, this is great -- that means you're one of the users that I
>> wasn't
>> > > sure existed or not :-). Now I know!
>> > >
>> > >> Today we keep the mask in a seperate array, and to zero-mask we do
>> > >>
>> > >> masked_data = data * mask
>> > >>
>> > >> or
>> > >>
>> > >> masked_data = data.copy()
>> > >> masked_data[mask == 0] = np.nan # soon np.NA
>> > >>
>> > >> depending on the circumstances.
>> > >>
>> > >> Honestly, API-wise, this is as good as its gets for us. Nice and
>> > >> transparent, no new semantics to learn in the special case of
>> masks.
>> > >>
>> > >> Now, this has performance issues: Lots of memory use, extra
>> > transfers
>> > >> over the memory bus.
>> > >
>> > > Right -- this is a case where (in the NA-overview terminology)
>> masked
>> > > storage+NA semantics would be useful.
>> > >
>> > >> BUT, NumPy has that problem all over the place, even for "x + y
>> > + z"!
>> > >> Solving it in the special case of masks, by making a new API,
>> > seems a
>> > >> bit myopic to me.
>> > >>
>> > >> IMO, that's much better solved at the fundamental level. As an
>> > >> *illustration*:
>> > >>
>> > >> with np.lazy:
>> > >> masked_data1 = data * mask1
>> > >> masked_data2 = data * (mask1 | mask2)
>> > >> masked_data3 = (x + y + z) * (mask1& mask3)
>> > >>
>> > >> This would create three "generator arrays" that would
>> zero-mask the
>> > >> arrays (and perform the three-term addition...) upon request.
>> > You could
>> > >> slice the generator arrays as you wish, and by that slice the
>> > data and
>> > >> the mask in one operation. Obviously this could handle
>> > NA-masking too.
>> > >>
>> > >> You can probably do this today with Theano and numexpr, and I
>> think
>> > >> Travis mentioned that "generator arrays" are on his radar for core
>> > NumPy.
>> > >
>> > > Implementing this today would require some black magic hacks,
>> because
>> > > on entry/exit to the context manager you'd have to "reach up"
>> > into the
>> > > calling scope and replace all the ndarray's with LazyArrays and
>> then
>> > > vice-versa. This is actually totally possible:
>> > > https://gist.github.com/2347382
>> > > but I'm not sure I'd call it *wise*. (You could probably avoid the
>> > > truly horrible set_globals_dict part of that gist, though.)
>> Might be
>> > > fun to prototype, though...
>> >
>> > 1) My main point was just that I believe masked arrays is
>> something that
>> > to me feels immature, and that it is the kind of thing that
>> should be
>> > constructed from simpler primitives. And that NumPy should
>> focus on
>> > simple primitives. You could make it
>> >
>> >
>> > I can't disagree, as I suggested the same as a possibility myself ;)
>> > There is a lot of infrastructure now in numpy, but given the use
>> cases
>> > I'm tending towards the view that masked arrays should be left to
>> > others, at least for the time being. The question is how to
>> generalize
>> > the infrastructure and what hooks to provide. I think just spending a
>> > month or two pulling stuff out is counter productive, but
>> evolving the
>> > code is definitely needed. If you could familiarize yourself with
>> what
>> > is in there, something that seems largely neglected by the
>> critics, and
>> > make suggestions, that would be helpful.
>>
>> But how on earth can I make constructive criticisms about code when I
>> don't know what the purpose of that code is supposed to be?
>>
>>
>> What do you mean? I thought the purpose was quite clearly laid out in
>> the NEP. But the implementation of that purpose required some
>> infrastructure. The point, I suppose, is for you to suggest what would
>> serve your use case.
>
> What would serve me? I use NumPy as a glorified "double*". And I'm
> probably going to continue doing so. I'd be most happy by the status quo
> of NumPy 1.6 + bugfixes.
>
> Main reason I'm getting involved is that the direction NumPy is going
> has consequences for Cython and Cython users and the Cython code I write
> myself.
>
> So my main comment is that I'd like (by whatever means necessary) to
> have PyObject_TypeCheck fail whenever you can't sanely access
> PyArray_DATA or in other ways use the good old NumPy C API (for whatever
> reason, such as masks).
>
> If PyObject_TypeCheck fails, existing C extension modules are
> forward-compatible, and so one doesn't have to go over old extension
> module code to add in a check for the presence of masks (or whatever)
> and raise an exception.
>
> Suggestion: Use the PyArrayObject struct for both ndarray and ndmasked,
> so that ndarray has NULL mask fields. So ob_type makes them appear
> different to Python and C extensions, but internally in NumPy you could
> just check for a NULL mask field rather than the type, if you prefer that.
>
> (When I say I like status quo: There are things I'd *like* to see
> changed in NumPy, but it's better phrased in terms of full
> reimplementation rather than iterative refinement, which probably means
> I should keep those thoughts to myself until I grow up. But until NumPy
> changes on a pretty fundamental level, all I want is my glorified
> "double*". I'm probably not a representative user.)
>
>> Are you saying you agree that the masking aspect should be banned (or at
>> least not "core"), and asking me to look at code from that perspective
>> and comment on how to get there while keeping as much as possible of the
>> rest? Would that really be helpful?
>>
>>
>> No, I don't agree that it should be banned, but your perspective seems
>> to be that it should be, so I ask you to determine what is worth
>> keeping. We can of course pull it all out and forget about the whole
>> thing. But I'm getting tired of people saying do this or that without
>> making technical suggestions that can be implemented, looking at the
>> code, testing things, and providing feedback. At a minimum, I expect you
>> to have an idea of how things *should* work and how to get there.
>
> Look, I'm used to coding; at least I used to spend a significant amount
> of time on Cython, doing some pretty extensive refactorings of the code
> base that touched pretty much all the code.
>
> And on the Cython list, we almost always keep details of implementation
> second-order to the semantics we want. We do keep an eye on "how
> difficult it would be to implement" while discussing semantics, and
> sometimes have to say "we ideally want this long-term, what's the
> stopgap solution we can use until we can make it work the way we want".
> But it's nothing like what you suggest above. And we don't compromise
> short-term if better semantics are available long-term (e.g., recent
> thread on None-checking).
>
> But talking about process is getting old, I guess.
>
> (Of course there's plenty of other factors involved that could explain
> why the Cython/SymPy/Sage/etc. development communities are so good, and
> the NumPy one so...ehrmm..)
I'm very sorry. I should have said "lists"! I don't really know anything
about the NumPy developer community.
Anyway, I sense my best course of action will be to shut up.
Dag
More information about the NumPy-Discussion
mailing list