[Numpy-discussion] Missing/accumulating data

Joe Harrington jh at physics.ucf.edu
Mon Jul 4 00:03:29 EDT 2011


Christopher Barker, Ph.D. wrote
> quick note on this: I like the "FALSE == good" way, because:

So, you like to have multiple different kinds of masked, but I need
multiple good values for counts.  We could do it with negative masks
and positive counts, but that doesn't reduce to a boolean for whoever
has the negatives.

We could have separate arrays, one for masks and one for counts, with
both being optional.  That's harder to implement and may be slower,
but there's precedent: Spacecraft data are given to the investigator
with several images per "data collection event".  One is the actual
image, another is the uncertainties per pixel, a third and often
fourth are 32-bit bitmasks for error codes.  There can be a dozen of
these (raw data, permanently bad pixel mask, etc.).

Chuck Harris wrote:

> Array access needs to be distinguished from array exposure. If the access
> goes through getter/setter functions than the underlying representation can
> change. Whether or not that degree of abstraction is needed is another
> question, but it does make things more flexible.

Well, I've never been excited about data structures so complicated you
can't manipulate them directly.  In teaching about data analysis, we
work hard to teach students *not* to stuff things into black boxes and
ignore what's really going on.  Too much abstraction is hard to think
about, if you're used to dealing with data directly yourself.

Mark Weibe wrote:

> The NA idea works with any dtype, like datetime, but 50% of a datetime isn't
> a reasonable concept, hurting the idea of general dtypes + alpha masking.

Yes, that is correct.  You shouldn't use an integer mask array with a
struct, you should use a boolean.  If you do use an int, you should
get an error.  I think the error would not cause too much confusion,
since it's obvious you shouldn't do that.

What I'm getting from all this discussion is that there's not much
consensus on an ancillary or masked datatype.  Someone could select
one of these options by fiat and make a small subset of the community
happy, but if it isn't solving a big problem for a lot of people, it
probably shouldn't be in the core, especially if a general solution
might be possible in the future with a little more thought.

--jh--



More information about the NumPy-Discussion mailing list