[Numpy-discussion] feedback request: proposal to add masks to the core ndarray

Sat Jun 25 16:05:01 EDT 2011

On Sat, Jun 25, 2011 at 6:17 AM, Matthew Brett <matthew.brett at gmail.com>wrote:

> Hi,
>
> On Sat, Jun 25, 2011 at 2:10 AM, Mark Wiebe <mwwiebe at gmail.com> wrote:
> > On Fri, Jun 24, 2011 at 7:02 PM, Matthew Brett <matthew.brett at gmail.com>
> > wrote:
> >>
> >> Hi,
> >>
> >> On Sat, Jun 25, 2011 at 12:22 AM, Wes McKinney <wesmckinn at gmail.com>
> >> wrote:
> >> ...
> >> > Perhaps we should make a wiki page someplace summarizing pros and cons
> >> > of the various implementation approaches?
> >>
> >> But - we should do this if it really is an open question which one we
> >> go for.   If not then, we're just slowing Mark down in getting to the
> >> implementation.
> >>
> >> Assuming the question is still open, here's a starter for the pros and
> >> cons:
> >>
> >> array.mask
> >> 1) It's easier / neater to implement
> >
> > Yes
> >
> >>
> >> 2) It can generalize across dtypes
> >
> > Yes
> >
> >>
> >> 3) You can still get the masked data underneath the mask (allowing you
> >> to unmask etc)
> >
> > By setting up views appropriately, yes. If you don't have another view to
> > the underlying data, you can't get at it.
> >>
> >> nafloat64:
> >> 1) No memory overhead
> >
> > Yes
> >
> >>
> >> 2) Battle-tested implementation already done in R
> >
> > We can't really use that though,  R is GPL and NumPy is BSD. The
> low-level
> > implementation details are likely different enough that a
> re-implementation
> > would be needed anyway.
>
> Right - I wasn't suggesting using the code, only that the idea can be
> made to work coherently with an API that seems to have won friends
> over time.
>

OK, so I think you mean a battle-tested implementation of the interface R
exposes. That interface can be implemented with either masks or NA bit
patterns, I don't believe it has anything specific to bit patterns inherent
in it.

>
> >> I guess we'd have to test directly whether the non-continuous memory
> >> of the mask and data would cause enough cache-miss problems to
> >> outweigh the potential cycle-savings from single byte comparisons in
> >> array.mask.
> >
> > The different memory buffers are each contiguous, so the access patterns
> > still have a lot of coherency. I intend to give the mask memory layouts
> > matching those of the arrays.
> >>
> >> I guess that one and only one of these will get written.  I guess that
> >> one of these choices may be a lot more satisfying to the current and
> >> future masked array itch than the other.
> >
> > I'm only going to implement one solution, yes.
> >>
> >> I'm personally worried that the memory overhead of array.masks will
> >> make many of us tend to avoid them.  I work with images that can
> >> easily get large enough that I would not want an array-items size byte
> >> array added to my storage.
> >
> > May I ask what kind of dtypes and sizes you're working with?
>
> dtypes for images usually end up as floats - float32 or float64.  On
> disk, and when memory mapped, they are often int16 or uint16.   Sizes
> vary from fairly small 3D images of say 64 x 64 x 32 (1M in float64)
> to rather large 4D images - say 256 x 256 x 50 x 500 at the very high
> end (12.5G in float64).
>

OK, so the mask would be an extra 128KB or 1.6G, respectively.

>> The reason I'm asking for more details about the implementation is
> >> because that is most of the argument for array.mask at the moment (1
> >> and 2 above).
> >
> > I'm first trying to nail down more of the higher level requirements
> before
> > digging really deep into the implementation details. They greatly affect
> how
> > those details have to turn out.
>
> Once you've started with the array.mask framework, you've committed
> yourself to the memory hit, and you may lose potential users who often
> hit memory limits.  My guess is that no-one currently using np.ma is
> in that category, because it also uses a separate mask array, as I
> understand it.
>

In the same way, if I start with the NA bit pattern framework, I've
committed to throwing away the underlying values, and I will lose potential
users who want to keep them. This tradeoff goes both ways, it looks like
nobody would be completely satisfied with only one of the two approaches.

-Mark

>
> See you,
>
> Matthew
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20110625/9583ca19/attachment.html>