[Numpy-discussion] feedback request: proposal to add masks to the core ndarray

Sat Jun 25 11:16:50 EDT 2011

Hi,

On Sat, Jun 25, 2011 at 3:27 PM, Charles R Harris
<charlesr.harris at gmail.com> wrote:
>
>
> On Sat, Jun 25, 2011 at 6:00 AM, Gael Varoquaux
> <gael.varoquaux at normalesup.org> wrote:
>>
>> On Sat, Jun 25, 2011 at 01:02:07AM +0100, Matthew Brett wrote:
>> > I'm personally worried that the memory overhead of array.masks will
>> > make many of us tend to avoid them.  I work with images that can
>> > easily get large enough that I would not want an array-items size byte
>> > array added to my storage.
>>
>> I work with the same kind of data ( :D ).
>>
>> The way we manipulate our data, in my lab, is, to represent 3D data with
>> a mask to use a 1D array and a corresponding 3D mask. It reduces the
>> memory footprint, to the cost of loosing the 3D shape of the data.
>>
>> I am raising this because it is an option that has not been suggested so
>> far. I am not saying that it should be the option used to implement mask
>> arrays, I am just saying that there are many different ways of doing it,
>> and I don't think that there is a one-size-fits-all solution.
>>
>> I tend to feel like Wes: good building block to easily implement our own
>> solutions is what we want.
>>
>
> Could you expand a bit on what sort of data you have and how you deal with
> it. Where does it come from, how is it stored on disk, what do you do with
> it? That sort of thing.

Gael and I groan with the same groan on this one.

Our input data are typically 4D and 3D medical images from MRI scanners.

The 4D images are usually whole brain scans (3D) collected every few
seconds and then concatenated to form the fourth dimension.

The data formats are typically very simple binary data in C float etc
format, with some associated metadata like the image shape, data type,
and sometimes, scaling factors to be applied to the integers on disk,
in order to get the desired output values.

The formats generally allow the images to be compressed (and still be
valid without decompression on the filesystem).

One major package (in matlab) insists that these images are not
compressed so that the package can memory map the arrays on disk, and,
when reading the data via a custom image object, apply the
scalefactors to the integer image values on the fly.

As Gael says, we often find that only - say - 50% of the image is of
interest - inside the brain.   In that case it's common to signal
pixels as not being of interest, using NaN values.   This has the
advantage that we can save the masked images to the simple image
formats by using a datatype that supports NaN.

Gael is pointing out a new masked array concept that uses less memory
then the original.  It's easy to imagine something like his suggestion
being implemented with Travis' deferred array proposal - I remember
getting quite excited about that when I saw it.

In practice, trying to deal with these images often involves
optimization for memory - in particular - because they are often too
large to conveniently fit in memory, or to keep more than a few in
memory.

Cheers,

Matthew