[Neuroimaging] Nibabel API change - always read as float

Mon Jul 6 17:32:00 CEST 2015

Hi,

I wanted to ask y'all about an API change that I want to make to nibabel.

In summary, I want to default to returning floating point arrays from
nibabel images.

Problem - different returned data types from img.get_data()
-------------------------------------------------------------------------------

At the moment, if you do this:

img = nib.load('my_image.nii')
data = img.get_data()

Then the data type (dtype) of the returned data array depends on the
values in the header of `my_image.nii`.   Specifically, if the raw
on-disk data type is 'np.int16' (it is often is) and the header
scalefactor values are default (1 for slope, 0 for intercept) then you
will get back an array of the on disk data type - here - np.int16.

This is very efficient on memory, but it it's a real trap unless you careful.

For example, let's say you had a pipeline where you did this:

sum = img.get_data().sum()

That would work fine most of the time, when the data on disk is
floating point, or the scalefactors are not default (1, 0).   Then one
day, you get an image with int16 data type on disk and 1, 0
scalefactors, and your `sum` calculation silently overflows.    I ran
into this when teaching - I had to cast some image arrays to floating
point to get sensible answers.

Solution
-----------

I think that the default behavior of nibabel should be to do the thing
least likely to trip you up by accident, so - I think in due course,
nibabel should always return a floating point array from `get_data()`
by default.

I propose to add a keyword-only argument to `get_data()` - `to_float`, as in:

data = img.get_data(to_float=False)  # The current default behavior
data = img.get_data(to_float=True)  # Integer arrays automatically
cast to float64

For this cycle (the nibabel 2.0 series), I propose to raise a warning
if you don't pass in an explicit True or False, warning that the
default behavior for nibabel 3.0 will change from `to_float=False` to
`to_float=True`.

The other, more fancy ways of getting the image data would continue as
they are, such as:

data = np.array(img.dataobj)
data = img.dataobj[:]

These will both return ints or floats depending on the raw data dtype
and the scalefactors.  This is on the basis that people using these
will be more advanced and so therefore more likely to want memory
efficiency at the expense of having to be careful about the returned
data dtype.

Does this seem reasonable to y'all?    Thoughts, suggestions?

Cheers,

Matthew