int A[2, 3, 4] = np.array([[[1,2,3,4], [5,6,7,8], [9,10,11,12]], [[13,14,15,16}, [17,18,19,20], [21,22,23,24]])
We walk that memory:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Practically, this means that we can most efficiently walk along the channels.
However! In my experience, especially with 2D images, you usually store most of your relationships across the image and leave the channels as independent of one another. A simple example of this, and the one I was considering, is calculating the gradient of a multi-channel image (so a 3 channel RGB image will return a 6-channel image that represents the x and y gradient over each channel R, G and B).
Now, by calculating the gradient on a [height, width, channels] image, I have to jump a lot further in memory to find my neighbours, because I'm also skipping over channels. This is actually very inefficient, and in my testing (implementing the gradient method in cython) this actually causes a huge bottleneck. For example, the more channels you have, the closer in performance my code comes to Numpy's pure Python gradient implementation. This is because both pieces of code are being hurt by having to work 'against' the memory layout.
A practical example of this can be seen here: http://nbviewer.ipython.org/gist/patricksnape/7b31c14c7103c2995956
As mentioned, I have some code I've been working on that implements the gradient directly in Cython and uses OpenMP to parallelize the gradient over the channels, which is significantly faster for the channels first case.
So, my question is: why did you choose to place the channels on the last axis rather than the first?