Hopefully I can expand a bit on the image range scaling point, with some historical perspective on image processing data handling and how we got to where we are today.

Long before scikit-image existed, image processing was often taught via Matlab with the appropriate toolbox.  Curricula have generally opened with discussions about datatype; this was central to interacting with the underlying image data.  A very old function provided in the Mathworks toolbox does this: im2double which takes an integer array and hands you back a double array.  However, it is not a simple dtype conversion; this function rescales the input image from the original range to the range [0, 1].  I believe the original reasoning for this was that certain image operations were simpler and more intuitive to consider/teach with unit range, e.g., exposure operations and inverting the image.

Scikit-Image has taken the position that we accept "[nearly] anything" as an input dtype, but we do not guarantee the output will match the input.  If a modified image is returned (exposure, transformed, denoised, etc.), generally integer images are accepted but floating point images are returned to avoid inherent precision loss in pipelines.  We check the input datatype to see if it needs to be changed for safety, and if necessary we do so.

Datatype conversion currently takes an input integer image and rescales it to [0, 1] - Matlab style - by default.  In many cases the rescaling step is not optional.  Rescaling to [0, 1] is expected by Matlab veterans but perennially confuses new users.  More concerning, some image data has physical meaning (e.g., CT Hounsfield units) and for obvious reasons, such users want to opt out of this behavior.  We've made it possible to turn off rescaling, and some functions now expose a `preserve_range=` kwarg, but the current default behavior remains to silently rescale for backwards compatibility.

We propose to globally remove the Matlab-style forced rescaling.  Our functions would no longer assume a unit range, and would instead respect the input data range even if conversion to float is required for safety.

Many if not most users may not even notice this change.  In the worst case it is a linear multiplicative scaling.  For those who do, it is easy to retain the prior behavior by normalizing their images in preprocessing or using `img_as_float()` with an optional kwarg to enable legacy unit normalization.

Put differently, the current state is like if Scikit-Learn automatically whitened all input data and you couldn't turn this off even if you wanted to.  Instead, Scikit-Learn strongly recommends whitening but ultimately the user is responsible for their data.  We want Scikit-Image to move to a similar model, where we do not impose Matlab-style rescaling on our users' data.


On Thu, Mar 5, 2020 at 5:51 AM Ralf Gommers <ralf.gommers@gmail.com> wrote:

On Thu, Mar 5, 2020 at 1:59 AM Stefan van der Walt <stefanv@berkeley.edu> wrote:
On Wed, Mar 4, 2020, at 15:01, Juan Nunez-Iglesias wrote:
> I might not be getting the full picture here: returning Bunches instead of (in most places) NumPy arrays would in itself be a breaking change? Similarly, one of the big changes we are proposing is returning an array of floats that is *not* rescaled to [0, 1]. That is, we want to still return a NumPy array (whether that is the plain array or an attribute in the Bunch) but the values in the array will be different. I don’t clearly see how Bunch solves that problem?

It doesn't indeed, nothing can solve that for you.

There are at least two considerations here:

- We want to stop coercing image data to a certain range based on the data type of the input. This will be a breaking change, overall, unless we introduce the `preserve_range` keyword widely.

I did not get that from "we want to change the return value of a function". I assumed it was adding new return values.

It seems like a *really* unhealthy idea to me to silently change numerical values. Despite the extensive communication, the vast majority of your users will not be aware of what's happening and run the risk of silently getting invalid results. I can't think of any important package in the SciPy/PyData ecosystem that has ever done what you're proposing after becoming popular. I would recommend to change the package name or name of the main namespace, to make sure people see an exception and become aware of the problem when they upgrade.

- We would like to make, consistently, most functions be of the form:

 output_image = function(input_image)

This makes for easy construction of pipelines:

 output_image = first_func(second_func(third_func(intput_image))

In cases where additional calculations are made, we want signatures of the form:

 output_image, additional_namedtuple = function.extra(input_image)

[Exactly how the calculation of additional_namedtuple is triggered is not set in stone; the `.extra` attribute on functions was one suggestion of how to do that easily.]

The usage of named tuples / bunches / data objects will be an integral part of the design.

Thanks. Those changes all sound really useful.


Best regards,
scikit-image mailing list -- scikit-image@python.org
To unsubscribe send an email to scikit-image-leave@python.org
scikit-image mailing list -- scikit-image@python.org
To unsubscribe send an email to scikit-image-leave@python.org