Re: [Numpy-discussion] Change default order to Fortran order
This is very good discussion. Thank you all for replying.
I can see the fundamental difference is that I always think/talk/read/write a 3D image as I(x, y, z), not (plane, row, column) . I am coming from MRI (magnetic resonance imaging) research, and I can assure you that the entire MRI community is using (x, y, z), including books, journal papers, conference abstracts, presentations, everything. We even talk about what we called "logical x/y/z" and "physical x/y/z", and the rotation matrix that converts the two coordinate systems. The radiologists are also used to (x, y, z). For example, we always say "my image is 256 by 256 by 20 slices", and we never say "20 by 256 by 256".
So, basically, at least in MRI, we always talk about an image as I(x, y, z), and we always assume that "x" is the fastest changing index. That's why I prefer column-major (because it is more natural).
Of course, I can totally get my work done by using row-major, I just have to always remind myself "write last dimension index first" when coding. I actually have done this before, and I found it would be so much easier if just using column-major.
Kang
On 08/02/15, Juan Nunez-Iglesias
Hi Kang,
Feel free to come chat about your application on the scikit-image list [1]! I'll note that we've been through the array order discussion many times there and even have a doc page about it [2].
The short version is that you'll save yourself a lot of pain by starting to think of your images as (plane, row, column) instead of (x, y, z). The syntax actually becomes friendlier too. For example, to do something to each slice of data, you do:
for plane in image: plane += foo
instead of
for z in image.shape[2]: image[:, :, z] += foo
for example.
Juan.
[1] scikit-image@googlegroups.com [2] http://scikit-image.org/docs/dev/user_guide/numpy_images.html#coordinate-con...
PS: As to the renamed Fortran-ordered numpy, may I suggest "funpy". The F is for Fortran and the fun is for all the fun you'll have maintaining it. =P
On Mon, 3 Aug 2015 at 6:28 am Daniel Sank
wrote: Kang,
Thank you for explaining your motivation. It's clear from your last note, as you said, that your desire for column-first indexing has nothing to do with in-memory data layout. That being the case, I strongly urge you to just use bare numpy and do not use the "fortran_zeros" function I recommended before. Changing the in-memory layout via the "order" keyword in numpy.zeros will not change the way indexing works at all. You gain absolutely nothing by changing the in-memory order unless you are writing some C or Fortran code which will interact with the data in memory.
To see what I mean, consider the following examples:
x = np.array([1, 2, 3], [4, 5, 6]]) x.shape
(2, 3)
and
x = np.array([1, 2, 3], [4, 5, 6]], order='F') x.shape
(2, 3)
You see that changing the in-memory order has nothing whatsoever to do with the array's shape or how you access it.
You will see run time error. Depending on environment, you may get useful error message (i.e. index out of range), but sometimes you just get bad image results.
Could you give a very simple example of what you mean? I can't think of how this could ever happen and your fear here makes me think there's a fundamental misunderstanding about how array operations in numpy and other programming languages work. As an example, iteration in numpy goes through the first index:
x = np.array([[1, 2, 3], [4, 5, 6]]) for foo in x: ...
Inside the for loop, foo takes on the values [1, 2, 3] on the first iteration and [4, 5, 6] on the second. If you want to iterate through the columns just do this instead
x = np.array([[1, 2, 3], [4, 5, 6]]) for foo in x.T: ...
If your complaint is that you want np.array([[1, 2, 3], [4, 5, 6]]) to produce an array with shape (3, 2) then you should own up to the fact that the array constructor expects it the other way around and do this
x = np.array([[1, 2, 3], [4, 5, 6]]).T
instead. This is infinity times better than trying to write a shim function or patch numpy because with .T you're using (fast) built-in functionality which other people your code will understand.
The real message here is that whether the first index runs over rows or columns is actually meaningless. The only places the row versus column issue has any meaning is when doing input/output (in which case you should use the transpose if you actually need it), or when doing iteration. One thing that would make sense if you're reading from a binary file format which uses column-major format would be to write your own reader function:
def read_fortran_style_binary_file(file): return np.fromfile(file).T
Note that if you do this then you already have a column major array in numpy and you don't have to worry about any other transposes (except, again, when doing more I/O or passing to something like a plotting function).
On Sun, Aug 2, 2015 at 7:16 PM, Kang Wang
wrote: Thank you all for replying and providing useful insights and suggestions.
The reasons I really want to use column-major are:
I am image-oriented user (not matrix-oriented, as explained in http://docs.scipy.org/doc/numpy/reference/internals.html#multidimensional-ar...) I am so used to read/write "I(x, y, z)" in textbook and code, and it is very likely that if the environment (row-major environment) forces me to write I(z, y, x), I will write a bug if I am not 100% focused. When this happens, it is difficult to debug, because everything compile and build fine. You will see run time error. Depending on environment, you may get useful error message (i.e. index out of range), but sometimes you just get bad image results. It actually has not too much to do with the actual data layout in memory. In imaging processing, especially medical imaging where I am working in, if you have a 3D image, everyone will agree that in memory, the X index is the fasted changing index, and the Z dimension (we often call it the "slice" dimension) has the largest stride in memory. So, if data layout is like this in memory, and image-oriented users are so used to read/write "I(x,y,z)", the only storage order that makes sense is column-major I also write code in MATLAB and C/C++. In MATLAB, matrix is column-major array. In C/C++, we often use ITK, which is also column-major (http://www.itk.org/Doxygen/html/classitk_1_1Image.html). I really prefer always read/write column-major code to minimize coding bugs related to storage order. I also prefer index to be 0-based; however, there is nothing I can do about it for MATLAB (which is 1-based).
I can see that my original thought about "modifying NumPy source and re-compile" is probably a bad idea. The suggestions about using "fortran_zeros = partial(np.zeros(order='F'))" is probably the best way so far, in my opinion, and I am going to give it a try.
Again, thank you all for replying.
Kang
On 08/02/15, Nathaniel Smith
wrote: On Aug 2, 2015 7:30 AM, "Sturla Molden"
wrote: On 02/08/15 15:55, Kang Wang wrote:
Can anyone provide any insight/help?
There is no "default order". There was before, but now all operators
control the order of their return arrays from the order of their input
array.
This is... overoptimistic. I would not rely on this in code that I wrote.
It's true that many numpy operations do preserve the input order. But there are also many operations that don't. And which are which often changes between releases. (Not on purpose usually, but it's an easy bug to introduce. And sometimes it is done intentionally, e.g. to make functions faster. It sucks to have to make a function slower for everyone because someone somewhere is depending on memory layout default details.) And there are operations where it's not even clear what preserving order means (indexing a C array with a Fortran array, add(C, fortran), ...), and even lots of operations that intrinsically break contiguity/ordering (transpose, diagonal, slicing, swapaxes, ...), so you will end up with mixed orderings one way or another in any non-trivial program.
Instead, it's better to explicitly specify order= just at the places where you care. That way your code is *locally* correct (you can check it will work by just reading the one function). The alternative is to try and enforce a *global* property on your program ("everyone everywhere is very careful to only use contiguity-preserving operations", where "everyone" includes third party libraries like numpy and others). In software design, local invariants invariants are always better than global invariants -- the most well known example is local variables versus global variables, but the principle is much broader.
-n
-- Kang Wang, Ph.D. 1111 Highland Ave., Room 1113 Madison, WI 53705-2275 ----------------------------------------
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
-- Daniel Sank
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
-- Kang Wang, Ph.D. 1111 Highland Ave., Room 1113 Madison, WI 53705-2275 ----------------------------------------
On Aug 2, 2015 11:06 PM, "Kang Wang"
This is very good discussion. Thank you all for replying.
I can see the fundamental difference is that I always
think/talk/read/write a 3D image as I(x, y, z), not (plane, row, column) . I am coming from MRI (magnetic resonance imaging) research, and I can assure you that the entire MRI community is using (x, y, z), including books, journal papers, conference abstracts, presentations, everything. We even talk about what we called "logical x/y/z" and "physical x/y/z", and the rotation matrix that converts the two coordinate systems. The radiologists are also used to (x, y, z). For example, we always say "my image is 256 by 256 by 20 slices", and we never say "20 by 256 by 256".
So, basically, at least in MRI, we always talk about an image as I(x, y,
z), and we always assume that "x" is the fastest changing index. That's why I prefer column-major (because it is more natural).
Of course, I can totally get my work done by using row-major, I just have
to always remind myself "write last dimension index first" when coding. I actually have done this before, and I found it would be so much easier if just using column-major. Why not just use I[x, y, z] like you're used to, and let the computer worry about the physical layout in memory? Sometimes this will be Fortran order and sometimes C order and sometimes something else, but you don't have to know or care; 99% of the time it doesn't matter. The worst case is that when you use a python wrapper to call into a library that can only handle Fortran order, then the wrapper will quietly have to convert the memory order around and it will be slightly slower than if you had used Fortran order in the first place. But in practice you'll barely ever notice this, and when you do, *then* you can tell numpy explicitly what memory layout you want in the situation where it matters. General principle: do what's easiest for the programmer, not what's easiest for the computer. -n
Hi,
On Mon, Aug 3, 2015 at 8:09 AM, Nathaniel Smith
On Aug 2, 2015 11:06 PM, "Kang Wang"
wrote: This is very good discussion. Thank you all for replying.
I can see the fundamental difference is that I always think/talk/read/write a 3D image as I(x, y, z), not (plane, row, column) . I am coming from MRI (magnetic resonance imaging) research, and I can assure you that the entire MRI community is using (x, y, z), including books, journal papers, conference abstracts, presentations, everything. We even talk about what we called "logical x/y/z" and "physical x/y/z", and the rotation matrix that converts the two coordinate systems. The radiologists are also used to (x, y, z). For example, we always say "my image is 256 by 256 by 20 slices", and we never say "20 by 256 by 256".
So, basically, at least in MRI, we always talk about an image as I(x, y, z), and we always assume that "x" is the fastest changing index. That's why I prefer column-major (because it is more natural).
Of course, I can totally get my work done by using row-major, I just have to always remind myself "write last dimension index first" when coding. I actually have done this before, and I found it would be so much easier if just using column-major.
Why not just use I[x, y, z] like you're used to, and let the computer worry about the physical layout in memory? Sometimes this will be Fortran order and sometimes C order and sometimes something else, but you don't have to know or care; 99% of the time it doesn't matter. The worst case is that when you use a python wrapper to call into a library that can only handle Fortran order, then the wrapper will quietly have to convert the memory order around and it will be slightly slower than if you had used Fortran order in the first place. But in practice you'll barely ever notice this, and when you do, *then* you can tell numpy explicitly what memory layout you want in the situation where it matters.
Yes - if you are using numpy, you really have to look numpy in the eye and say: "I will let you worry about the array element order in memory, and in return, you promise to make indexing work as I would expect" Just for example, let's say you loaded an MRI image into memory: In [1]: import nibabel In [2]: img = nibabel.load('my_mri.nii') In [3]: data = img.get_data() Because NIfTI images are Fortran memory layout, this happens to be the memory layout you get for your array: In [4]: data.flags Out[4]: C_CONTIGUOUS : False F_CONTIGUOUS : True OWNDATA : False WRITEABLE : True ALIGNED : True UPDATEIFCOPY : False But now - in Python - all I care about is what data I have on the first, second, third axes. For example, I could do this: In [5]: data_copy = data.copy() This has exactly the same values as the original array, and at the same index positions: In [7]: import numpy as np In [8]: np.all(data == data) Out[8]: memmap(True, dtype=bool) but I now have a C memory layout array. In [9]: data_copy.flags Out[9]: C_CONTIGUOUS : True F_CONTIGUOUS : False OWNDATA : True WRITEABLE : True ALIGNED : True UPDATEIFCOPY : False Worse than that, if I slice my original data array, then I get an array that is neither C- or Fortran- compatible in memory: In [10]: data_view = data[:, :, ::2] In [11]: data_view.flags Out[11]: C_CONTIGUOUS : False F_CONTIGUOUS : False OWNDATA : False WRITEABLE : True ALIGNED : True UPDATEIFCOPY : False So - if you want every array to be Fortran-contiguous in memory, I would not start with numpy at all, I would write your own array library. The alternative - or "the numpy way" - is to give up on enforcing a particular layout in memory, until you need to pass an array to some C or C++ or Fortran code that needs some particular layout, in which case you get your extension code to copy the array into the required layout on entry. Of course this is what numpy itself has to do when interfacing with external libraries like BLAS or LAPACK. Cheers, Matthew
On Mon Aug 3 10:49:35 2015 GMT+0200, Matthew Brett wrote:
Hi,
On Mon, Aug 3, 2015 at 8:09 AM, Nathaniel Smith
wrote: On Aug 2, 2015 11:06 PM, "Kang Wang"
wrote: This is very good discussion. Thank you all for replying.
I can see the fundamental difference is that I always think/talk/read/write a 3D image as I(x, y, z), not (plane, row, column) . I am coming from MRI (magnetic resonance imaging) research, and I can assure you that the entire MRI community is using (x, y, z), including books, journal papers, conference abstracts, presentations, everything. We even talk about what we called "logical x/y/z" and "physical x/y/z", and the rotation matrix that converts the two coordinate systems. The radiologists are also used to (x, y, z). For example, we always say "my image is 256 by 256 by 20 slices", and we never say "20 by 256 by 256".
So, basically, at least in MRI, we always talk about an image as I(x, y, z), and we always assume that "x" is the fastest changing index. That's why I prefer column-major (because it is more natural).
Of course, I can totally get my work done by using row-major, I just have to always remind myself "write last dimension index first" when coding. I actually have done this before, and I found it would be so much easier if just using column-major.
Why not just use I[x, y, z] like you're used to, and let the computer worry about the physical layout in memory? Sometimes this will be Fortran order and sometimes C order and sometimes something else, but you don't have to know or care; 99% of the time it doesn't matter. The worst case is that when you use a python wrapper to call into a library that can only handle Fortran order, then the wrapper will quietly have to convert the memory order around and it will be slightly slower than if you had used Fortran order in the first place. But in practice you'll barely ever notice this, and when you do, *then* you can tell numpy explicitly what memory layout you want in the situation where it matters.
Yes - if you are using numpy, you really have to look numpy in the eye and say:
"I will let you worry about the array element order in memory, and in return, you promise to make indexing work as I would expect"
Just for example, let's say you loaded an MRI image into memory:
In [1]: import nibabel In [2]: img = nibabel.load('my_mri.nii') In [3]: data = img.get_data()
Because NIfTI images are Fortran memory layout, this happens to be the memory layout you get for your array:
In [4]: data.flags Out[4]: C_CONTIGUOUS : False F_CONTIGUOUS : True OWNDATA : False WRITEABLE : True ALIGNED : True UPDATEIFCOPY : False
But now - in Python - all I care about is what data I have on the first, second, third axes. For example, I could do this:
In [5]: data_copy = data.copy()
This has exactly the same values as the original array, and at the same index positions:
In [7]: import numpy as np In [8]: np.all(data == data) Out[8]: memmap(True, dtype=bool)
but I now have a C memory layout array.
In [9]: data_copy.flags Out[9]: C_CONTIGUOUS : True F_CONTIGUOUS : False OWNDATA : True WRITEABLE : True ALIGNED : True UPDATEIFCOPY : False
Yeah, I would like to second those arguments. Most of the time, there is no need to worry about layout. For large chunks you allocate, it may make sense for speed, etc. So you can alias creation functions. Generally, I would suggest to simply not worry about the memory layout. Also do not *trust* the layout for most function returns. If you need a specific layout to interface other code, always check what you got it. -Sebastian
Worse than that, if I slice my original data array, then I get an array that is neither C- or Fortran- compatible in memory:
In [10]: data_view = data[:, :, ::2] In [11]: data_view.flags Out[11]: C_CONTIGUOUS : False F_CONTIGUOUS : False OWNDATA : False WRITEABLE : True ALIGNED : True UPDATEIFCOPY : False
So - if you want every array to be Fortran-contiguous in memory, I would not start with numpy at all, I would write your own array library.
The alternative - or "the numpy way" - is to give up on enforcing a particular layout in memory, until you need to pass an array to some C or C++ or Fortran code that needs some particular layout, in which case you get your extension code to copy the array into the required layout on entry. Of course this is what numpy itself has to do when interfacing with external libraries like BLAS or LAPACK.
Cheers,
Matthew _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
I agree that often you don't need to worry about the memory order.
However, it is not uncommon in medical imaging to go back and forth between
a 2D or 3D image representation and a 1D array representation (e.g. as
often used in image reconstruction algorithms). I found that the main time
it was necessary to pay careful attention to the memory layout was when
converting Matlab scripts that involve reshaping operations.
On Mon, Aug 3, 2015 at 8:02 AM, Sebastian Berg
On Mon Aug 3 10:49:35 2015 GMT+0200, Matthew Brett wrote:
Hi,
On Mon, Aug 3, 2015 at 8:09 AM, Nathaniel Smith
wrote: On Aug 2, 2015 11:06 PM, "Kang Wang"
wrote: This is very good discussion. Thank you all for replying.
I can see the fundamental difference is that I always think/talk/read/write a 3D image as I(x, y, z), not (plane, row,
column) . I > >> am coming from MRI (magnetic resonance imaging) research, and I can assure > >> you that the entire MRI community is using (x, y, z), including books, > >> journal papers, conference abstracts, presentations, everything. We even > >> talk about what we called "logical x/y/z" and "physical x/y/z", and the > >> rotation matrix that converts the two coordinate systems. The radiologists > >> are also used to (x, y, z). For example, we always say "my image is 256 by > >> 256 by 20 slices", and we never say "20 by 256 by 256". > >> > >> So, basically, at least in MRI, we always talk about an image as I(x, y, > >> z), and we always assume that "x" is the fastest changing index. That's why > >> I prefer column-major (because it is more natural). > >> > >> Of course, I can totally get my work done by using row-major, I just have > >> to always remind myself "write last dimension index first" when coding. I > >> actually have done this before, and I found it would be so much easier if > >> just using column-major. > > > > Why not just use I[x, y, z] like you're used to, and let the computer worry > > about the physical layout in memory? Sometimes this will be Fortran order > > and sometimes C order and sometimes something else, but you don't have to > > know or care; 99% of the time it doesn't matter. The worst case is that when > > you use a python wrapper to call into a library that can only handle Fortran > > order, then the wrapper will quietly have to convert the memory order around > > and it will be slightly slower than if you had used Fortran order in the > > first place. But in practice you'll barely ever notice this, and when you > > do, *then* you can tell numpy explicitly what memory layout you want in the > > situation where it matters. > > Yes - if you are using numpy, you really have to look numpy in the eye and say: > > "I will let you worry about the array element order in memory, and in > return, you promise to make indexing work as I would expect" > > Just for example, let's say you loaded an MRI image into memory: > > In [1]: import nibabel > In [2]: img = nibabel.load('my_mri.nii') > In [3]: data = img.get_data() > > Because NIfTI images are Fortran memory layout, this happens to be the > memory layout you get for your array: > > In [4]: data.flags > Out[4]: > C_CONTIGUOUS : False > F_CONTIGUOUS : True > OWNDATA : False > WRITEABLE : True > ALIGNED : True > UPDATEIFCOPY : False > > But now - in Python - all I care about is what data I have on the > first, second, third axes. For example, I could do this: > > In [5]: data_copy = data.copy() > > This has exactly the same values as the original array, and at the > same index positions: > > In [7]: import numpy as np > In [8]: np.all(data == data) > Out[8]: memmap(True, dtype=bool) > > but I now have a C memory layout array. > > In [9]: data_copy.flags > Out[9]: > C_CONTIGUOUS : True > F_CONTIGUOUS : False > OWNDATA : True > WRITEABLE : True > ALIGNED : True > UPDATEIFCOPY : False >
Yeah, I would like to second those arguments. Most of the time, there is no need to worry about layout. For large chunks you allocate, it may make sense for speed, etc. So you can alias creation functions. Generally, I would suggest to simply not worry about the memory layout. Also do not *trust* the layout for most function returns. If you need a specific layout to interface other code, always check what you got it.
-Sebastian
Worse than that, if I slice my original data array, then I get an array that is neither C- or Fortran- compatible in memory:
In [10]: data_view = data[:, :, ::2] In [11]: data_view.flags Out[11]: C_CONTIGUOUS : False F_CONTIGUOUS : False OWNDATA : False WRITEABLE : True ALIGNED : True UPDATEIFCOPY : False
So - if you want every array to be Fortran-contiguous in memory, I would not start with numpy at all, I would write your own array library.
The alternative - or "the numpy way" - is to give up on enforcing a particular layout in memory, until you need to pass an array to some C or C++ or Fortran code that needs some particular layout, in which case you get your extension code to copy the array into the required layout on entry. Of course this is what numpy itself has to do when interfacing with external libraries like BLAS or LAPACK.
Cheers,
Matthew _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Hi,
On Mon, Aug 3, 2015 at 3:13 PM, Gregory Lee
I agree that often you don't need to worry about the memory order. However, it is not uncommon in medical imaging to go back and forth between a 2D or 3D image representation and a 1D array representation (e.g. as often used in image reconstruction algorithms). I found that the main time it was necessary to pay careful attention to the memory layout was when converting Matlab scripts that involve reshaping operations.
Yes, good point. A typical example would be this kind of thing: # data is a 4D array with time / volume axis last data_2d = data.reshape((-1, data.shape[-1]) For MATLAB, the columns of this array would (by default) have the values on the first axis fastest changing, then the second, then the third, whereas numpy's default is the other way round. I find I usually don't have to worry about this, because I'm later going to do: data_processed_4d = data_2d.reshape(data.shape) which will reverse the previous reshape in the correct way. But in any case - this is not directly to do with the array memory layout. You will get the same output from reshape whether the memory layout of `data` was Fortran or C. Cheers, Matthew
On Mon Aug 3 16:26:10 2015 GMT+0200, Matthew Brett wrote:
Hi,
On Mon, Aug 3, 2015 at 3:13 PM, Gregory Lee
wrote: I agree that often you don't need to worry about the memory order. However, it is not uncommon in medical imaging to go back and forth between a 2D or 3D image representation and a 1D array representation (e.g. as often used in image reconstruction algorithms). I found that the main time it was necessary to pay careful attention to the memory layout was when converting Matlab scripts that involve reshaping operations.
Yes, good point. A typical example would be this kind of thing:
# data is a 4D array with time / volume axis last data_2d = data.reshape((-1, data.shape[-1])
For MATLAB, the columns of this array would (by default) have the values on the first axis fastest changing, then the second, then the third, whereas numpy's default is the other way round.
I find I usually don't have to worry about this, because I'm later going to do:
data_processed_4d = data_2d.reshape(data.shape)
which will reverse the previous reshape in the correct way.
But in any case - this is not directly to do with the array memory layout. You will get the same output from reshape whether the memory layout of `data` was Fortran or C.
Just as a remark. Reshape has an (iteration not really memory) order parameter, thou it may do more copies if those do not match. - Sebastian
Cheers,
Matthew _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
participants (5)
-
Gregory Lee
-
Kang Wang
-
Matthew Brett
-
Nathaniel Smith
-
Sebastian Berg