[Numpy-discussion] feedback request: proposal to add masks to the core ndarray

Sat Jun 25 07:17:53 EDT 2011

Hi,

On Sat, Jun 25, 2011 at 2:10 AM, Mark Wiebe <mwwiebe at gmail.com> wrote:
> On Fri, Jun 24, 2011 at 7:02 PM, Matthew Brett <matthew.brett at gmail.com>
> wrote:
>>
>> Hi,
>>
>> On Sat, Jun 25, 2011 at 12:22 AM, Wes McKinney <wesmckinn at gmail.com>
>> wrote:
>> ...
>> > Perhaps we should make a wiki page someplace summarizing pros and cons
>> > of the various implementation approaches?
>>
>> But - we should do this if it really is an open question which one we
>> go for.   If not then, we're just slowing Mark down in getting to the
>> implementation.
>>
>> Assuming the question is still open, here's a starter for the pros and
>> cons:
>>
>> array.mask
>> 1) It's easier / neater to implement
>
> Yes
>
>>
>> 2) It can generalize across dtypes
>
> Yes
>
>>
>> 3) You can still get the masked data underneath the mask (allowing you
>> to unmask etc)
>
> By setting up views appropriately, yes. If you don't have another view to
> the underlying data, you can't get at it.
>>
>> nafloat64:
>> 1) No memory overhead
>
> Yes
>
>>
>> 2) Battle-tested implementation already done in R
>
> We can't really use that though,  R is GPL and NumPy is BSD. The low-level
> implementation details are likely different enough that a re-implementation
> would be needed anyway.

Right - I wasn't suggesting using the code, only that the idea can be
made to work coherently with an API that seems to have won friends
over time.

>> I guess we'd have to test directly whether the non-continuous memory
>> of the mask and data would cause enough cache-miss problems to
>> outweigh the potential cycle-savings from single byte comparisons in
>> array.mask.
>
> The different memory buffers are each contiguous, so the access patterns
> still have a lot of coherency. I intend to give the mask memory layouts
> matching those of the arrays.
>>
>> I guess that one and only one of these will get written.  I guess that
>> one of these choices may be a lot more satisfying to the current and
>> future masked array itch than the other.
>
> I'm only going to implement one solution, yes.
>>
>> I'm personally worried that the memory overhead of array.masks will
>> make many of us tend to avoid them.  I work with images that can
>> easily get large enough that I would not want an array-items size byte
>> array added to my storage.
>
> May I ask what kind of dtypes and sizes you're working with?

dtypes for images usually end up as floats - float32 or float64.  On
disk, and when memory mapped, they are often int16 or uint16.   Sizes
vary from fairly small 3D images of say 64 x 64 x 32 (1M in float64)
to rather large 4D images - say 256 x 256 x 50 x 500 at the very high
end (12.5G in float64).

>> The reason I'm asking for more details about the implementation is
>> because that is most of the argument for array.mask at the moment (1
>> and 2 above).
>
> I'm first trying to nail down more of the higher level requirements before
> digging really deep into the implementation details. They greatly affect how
> those details have to turn out.

Once you've started with the array.mask framework, you've committed
yourself to the memory hit, and you may lose potential users who often
hit memory limits.  My guess is that no-one currently using np.ma is
in that category, because it also uses a separate mask array, as I
understand it.

See you,

Matthew