Hi,

I'm one of the maintainers of a project that has some overlap with scikit-image. Recently, I've been working on optimising some of our slower code paths, and I came across an interesting result. In our package, we order our image data in the same way as in scikit-image:

image.shape = [height, width, channels]

Given that cPython runs on top of C, this means that the normal memory layout is generally row-major. Therefore, in the C-contiguous layout of the above memory pattern, it means that the rightmost channel indexes fastest. For example, when iterating over a vectorised image (stolen shamelessly from Wikipedia):

int A[2, 3, 4] = np.array([[[1,2,3,4], [5,6,7,8], [9,10,11,12]], [[13,14,15,16}, [17,18,19,20], [21,22,23,24]])

We walk that memory:

1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24

Practically, this means that we can most efficiently walk along the channels. 

However! In my experience, especially with 2D images, you usually store most of your relationships across the image and leave the channels as independent of one another. A simple example of this, and the one I was considering, is calculating the gradient of a multi-channel image (so a 3 channel RGB image will return a 6-channel image that represents the x and y gradient over each channel R, G and B).

Now, by calculating the gradient on a [height, width, channels] image, I have to jump a lot further in memory to find my neighbours, because I'm also skipping over channels. This is actually very inefficient, and in my testing (implementing the gradient method in cython) this actually causes a huge bottleneck. For example, the more channels you have, the closer in performance my code comes to Numpy's pure Python gradient implementation. This is because both pieces of code are being hurt by having to work 'against' the memory layout. 

A practical example of this can be seen here: http://nbviewer.ipython.org/gist/patricksnape/7b31c14c7103c2995956

As mentioned, I have some code I've been working on that implements the gradient directly in Cython and uses OpenMP to parallelize the gradient over the channels, which is significantly faster for the channels first case.

So, my question is: why did you choose to place the channels on the last axis rather than the first?