Unnecessarily bad performance of elementwise operators with Fortran-arrays

Hi! I wonder why simple elementwise operations like "a * 2" or "a + 1" are not performed in order of increasing memory addresses in order to exploit CPU caches etc. - as it is now, their speed drops by a factor of around 3 simply by transpose()ing. Similarly (but even less logical), copy() and even the constructor are affected (yes, I understand that copy() creates contiguous arrays, but shouldn't it respect/retain the order nevertheless?): ### constructor ### In [89]: %timeit -r 10 -n 1000000 numpy.ndarray((3,3,3)) 1000000 loops, best of 10: 1.19 s per loop In [90]: %timeit -r 10 -n 1000000 numpy.ndarray((3,3,3), order="f") 1000000 loops, best of 10: 2.19 s per loop ### copy 3x3x3 array ### In [85]: a = numpy.ndarray((3,3,3)) In [86]: %timeit -r 10 a.copy() 1000000 loops, best of 10: 1.14 s per loop In [87]: a = numpy.ndarray((3,3,3), order="f") In [88]: %timeit -r 10 -n 1000000 a.copy() 1000000 loops, best of 10: 3.39 s per loop ### copy 256x256x256 array ### In [74]: a = numpy.ndarray((256,256,256)) In [75]: %timeit -r 10 a.copy() 10 loops, best of 10: 119 ms per loop In [76]: a = numpy.ndarray((256,256,256), order="f") In [77]: %timeit -r 10 a.copy() 10 loops, best of 10: 274 ms per loop ### fill ### In [79]: a = numpy.ndarray((256,256,256)) In [80]: %timeit -r 10 a.fill(0) 10 loops, best of 10: 60.2 ms per loop In [81]: a = numpy.ndarray((256,256,256), order="f") In [82]: %timeit -r 10 a.fill(0) 10 loops, best of 10: 60.2 ms per loop ### power ### In [151]: a = numpy.ndarray((256,256,256)) In [152]: %timeit -r 10 a ** 2 10 loops, best of 10: 124 ms per loop In [153]: a = numpy.asfortranarray(a) In [154]: %timeit -r 10 a ** 2 10 loops, best of 10: 458 ms per loop ### addition ### In [160]: a = numpy.ndarray((256,256,256)) In [161]: %timeit -r 10 a + 1 10 loops, best of 10: 139 ms per loop In [162]: a = numpy.asfortranarray(a) In [163]: %timeit -r 10 a + 1 10 loops, best of 10: 465 ms per loop ### fft ### In [146]: %timeit -r 10 numpy.fft.fft(vol, axis=0) 10 loops, best of 10: 1.16 s per loop In [148]: %timeit -r 10 numpy.fft.fft(vol0, axis=2) 10 loops, best of 10: 1.16 s per loop In [149]: vol.flags Out[149]: C_CONTIGUOUS : True F_CONTIGUOUS : False OWNDATA : True WRITEABLE : True ALIGNED : True UPDATEIFCOPY : False In [150]: vol0.flags Out[150]: C_CONTIGUOUS : False F_CONTIGUOUS : True OWNDATA : False WRITEABLE : True ALIGNED : True UPDATEIFCOPY : False In [9]: %timeit -r 10 numpy.fft.fft(vol0, axis=0) 10 loops, best of 10: 939 ms per loop ### mean ### In [173]: %timeit -r 10 vol.mean() 10 loops, best of 10: 272 ms per loop In [174]: %timeit -r 10 vol0.mean() 10 loops, best of 10: 683 ms per loop ### max ### In [175]: %timeit -r 10 vol.max() 10 loops, best of 10: 63.8 ms per loop In [176]: %timeit -r 10 vol0.max() 10 loops, best of 10: 475 ms per loop ### min ### In [177]: %timeit -r 10 vol.min() 10 loops, best of 10: 63.8 ms per loop In [178]: %timeit -r 10 vol0.min() 10 loops, best of 10: 476 ms per loop ### rot90 ### In [10]: %timeit -r 10 numpy.rot90(vol) 100000 loops, best of 10: 6.97 s per loop In [12]: %timeit -r 10 numpy.rot90(vol0) 100000 loops, best of 10: 6.92 s per loop -- Ciao, / / /--/ / / ANS

Hi!
I wonder why simple elementwise operations like "a * 2" or "a + 1" are not performed in order of increasing memory addresses in order to exploit CPU caches etc. - as it is now, their speed drops by a factor of around 3 simply by transpose()ing. Because it is not trivial to do so in all cases, I guess. It is a
Hans Meine wrote: problem which comes back time to time on the ML, but AFAIK, nobody had a fix for it. Fundamentally, for many element-wise operations, either you have to implement the thing for every possible case, or you abstract it through an iterator, which gives you a decrease of performances in some cases. There are also cases where the current implementation is far from optimal, for lack of man power I guess (taking a look at PyArray_Mean, for example, shows that it uses PyArray_GenericReduceFunction, which is really slow compare to a straight C implementation).
Similarly (but even less logical), copy() and even the constructor are affected (yes, I understand that copy() creates contiguous arrays, but shouldn't it respect/retain the order nevertheless?):
I don't see why it is illogical: when you do a copy, you don't preserve memory layout, and so a simple memcpy of the whole buffer is not possible. cheers, David
### constructor ### In [89]: %timeit -r 10 -n 1000000 numpy.ndarray((3,3,3)) 1000000 loops, best of 10: 1.19 s per loop
In [90]: %timeit -r 10 -n 1000000 numpy.ndarray((3,3,3), order="f") 1000000 loops, best of 10: 2.19 s per loop
### copy 3x3x3 array ### In [85]: a = numpy.ndarray((3,3,3))
In [86]: %timeit -r 10 a.copy() 1000000 loops, best of 10: 1.14 s per loop
In [87]: a = numpy.ndarray((3,3,3), order="f")
In [88]: %timeit -r 10 -n 1000000 a.copy() 1000000 loops, best of 10: 3.39 s per loop
### copy 256x256x256 array ### In [74]: a = numpy.ndarray((256,256,256))
In [75]: %timeit -r 10 a.copy() 10 loops, best of 10: 119 ms per loop
In [76]: a = numpy.ndarray((256,256,256), order="f")
In [77]: %timeit -r 10 a.copy() 10 loops, best of 10: 274 ms per loop
### fill ### In [79]: a = numpy.ndarray((256,256,256))
In [80]: %timeit -r 10 a.fill(0) 10 loops, best of 10: 60.2 ms per loop
In [81]: a = numpy.ndarray((256,256,256), order="f")
In [82]: %timeit -r 10 a.fill(0) 10 loops, best of 10: 60.2 ms per loop
### power ### In [151]: a = numpy.ndarray((256,256,256))
In [152]: %timeit -r 10 a ** 2 10 loops, best of 10: 124 ms per loop
In [153]: a = numpy.asfortranarray(a)
In [154]: %timeit -r 10 a ** 2 10 loops, best of 10: 458 ms per loop
### addition ### In [160]: a = numpy.ndarray((256,256,256))
In [161]: %timeit -r 10 a + 1 10 loops, best of 10: 139 ms per loop
In [162]: a = numpy.asfortranarray(a)
In [163]: %timeit -r 10 a + 1 10 loops, best of 10: 465 ms per loop
### fft ### In [146]: %timeit -r 10 numpy.fft.fft(vol, axis=0) 10 loops, best of 10: 1.16 s per loop
In [148]: %timeit -r 10 numpy.fft.fft(vol0, axis=2) 10 loops, best of 10: 1.16 s per loop
In [149]: vol.flags Out[149]: C_CONTIGUOUS : True F_CONTIGUOUS : False OWNDATA : True WRITEABLE : True ALIGNED : True UPDATEIFCOPY : False
In [150]: vol0.flags Out[150]: C_CONTIGUOUS : False F_CONTIGUOUS : True OWNDATA : False WRITEABLE : True ALIGNED : True UPDATEIFCOPY : False
In [9]: %timeit -r 10 numpy.fft.fft(vol0, axis=0) 10 loops, best of 10: 939 ms per loop
### mean ### In [173]: %timeit -r 10 vol.mean() 10 loops, best of 10: 272 ms per loop
In [174]: %timeit -r 10 vol0.mean() 10 loops, best of 10: 683 ms per loop
### max ### In [175]: %timeit -r 10 vol.max() 10 loops, best of 10: 63.8 ms per loop
In [176]: %timeit -r 10 vol0.max() 10 loops, best of 10: 475 ms per loop
### min ### In [177]: %timeit -r 10 vol.min() 10 loops, best of 10: 63.8 ms per loop
In [178]: %timeit -r 10 vol0.min() 10 loops, best of 10: 476 ms per loop
### rot90 ### In [10]: %timeit -r 10 numpy.rot90(vol) 100000 loops, best of 10: 6.97 s per loop
In [12]: %timeit -r 10 numpy.rot90(vol0) 100000 loops, best of 10: 6.92 s per loop

Am Donnerstag, 08. November 2007 13:44:59 schrieb David Cournapeau:
Hans Meine wrote:
I wonder why simple elementwise operations like "a * 2" or "a + 1" are not performed in order of increasing memory addresses in order to exploit CPU caches etc. - as it is now, their speed drops by a factor of around 3 simply by transpose()ing.
Because it is not trivial to do so in all cases, I guess. It is a problem which comes back time to time on the ML, but AFAIK, nobody had a fix for it. Fundamentally, for many element-wise operations, either you have to implement the thing for every possible case, or you abstract it through an iterator, which gives you a decrease of performances in some cases.
OK, so far I have not looked at the NumPy internals, but I would expect sth. like the following: [elementwise operation on array 'a'] 1) ii = [i for i, s in sorted(enumerate(ia.strides), key = lambda (i, s): -s)] 2) aprime = a.transpose(ii) # aprime will be "as C-contiguous as it gets" 3) bprime = perform_operation(aprime) # fast elementwise operation 4) jj = [ii.index(i) for i in range(bprime.ndim)] # inverse ii 5) b = bprime.transpose(jj) # let result have the same order as the input Apart from the fact that this is more or less pseudo-code and that the last step brings a behaviour change (which is intended, but probably not very welcome), am I overlooking a problem?
Similarly (but even less logical), copy() and even the constructor are affected (yes, I understand that copy() creates contiguous arrays, but shouldn't it respect/retain the order nevertheless?):
I don't see why it is illogical: when you do a copy, you don't preserve memory layout, and so a simple memcpy of the whole buffer is not possible.
As I wrote above, I don't think this is good. A fortran-order-contiguous array is still contiguous, and not inferior in any way to C-order arrays. So I actually expect copy() to return an array of unchanged order. The numpy book only says "copy() - Return a copy of the array (which is always single-segment, and ALIGNED).", which would be fulfilled also if the memory order was preserved as by my proposed method from above. -- Ciao, / / /--/ / / ANS

On Nov 8, 2007 10:13 PM, Hans Meine <meine@informatik.uni-hamburg.de> wrote:
Am Donnerstag, 08. November 2007 13:44:59 schrieb David Cournapeau:
Hans Meine wrote:
I wonder why simple elementwise operations like "a * 2" or "a + 1" are not performed in order of increasing memory addresses in order to exploit CPU caches etc. - as it is now, their speed drops by a factor of around 3 simply by transpose()ing.
Because it is not trivial to do so in all cases, I guess. It is a problem which comes back time to time on the ML, but AFAIK, nobody had a fix for it. Fundamentally, for many element-wise operations, either you have to implement the thing for every possible case, or you abstract it through an iterator, which gives you a decrease of performances in some cases.
OK, so far I have not looked at the NumPy internals, but I would expect sth. like the following:
[elementwise operation on array 'a'] 1) ii = [i for i, s in sorted(enumerate(ia.strides), key = lambda (i, s): -s)] 2) aprime = a.transpose(ii) # aprime will be "as C-contiguous as it gets" 3) bprime = perform_operation(aprime) # fast elementwise operation 4) jj = [ii.index(i) for i in range(bprime.ndim)] # inverse ii 5) b = bprime.transpose(jj) # let result have the same order as the input The problem is not F vs C storage: for element-wise operation, it does not matter at all; you just apply the same function (perform_operation) over and over on every element of the array. The order does not matter at all.
But what if you have segmented buffers, non aligned, etc... arrays ? All this has to be taken care of, and this means handling reference count and other things which are always delicate to handle well... Or you just use the current situation, which let python handle it (through PyObject_Call and a callable, at a C level). I agree that in some cases, it would be useful to have better performances: in particular, having fast code paths for common cases (aligned, and single segmented) would be nice. One example is the PyArray_Clip function (in multiarray.c), whose speed was too slow for my taste in my use cases: the general idea is to get to the point where you have a C single buffer at which point you can call a trivial C function (fast_clip). As you can see, this is not trivial (~150 lines of C code): since the point is to go faster, you really want to avoid copying things, which means you have to be pretty careful. Speed-wise, it definitely worths it, though: I don't remember the exact numbers, but in some cases (which are very common), it was a several times speed.
Apart from the fact that this is more or less pseudo-code and that the last step brings a behaviour change (which is intended, but probably not very welcome), am I overlooking a problem?
Similarly (but even less logical), copy() and even the constructor are affected (yes, I understand that copy() creates contiguous arrays, but shouldn't it respect/retain the order nevertheless?):
I don't see why it is illogical: when you do a copy, you don't preserve memory layout, and so a simple memcpy of the whole buffer is not possible.
As I wrote above, I don't think this is good. A fortran-order-contiguous array is still contiguous, and not inferior in any way to C-order arrays. So I actually expect copy() to return an array of unchanged order.
Maybe this behaviour was kept for compatibility with numarray ? If you look at the docstring, it is said that copy may not return the same order than its input. It kind of make sense to me: C order is the default of many numpy operations. David

Am Donnerstag, 08. November 2007 16:37:06 schrieb David Cournapeau:
The problem is not F vs C storage: for element-wise operation, it does not matter at all; you just apply the same function (perform_operation) over and over on every element of the array. The order does not matter at all.
Yet Fortran order leads to several times slower operations, see the figures in my original post. :-(
But what if you have segmented buffers, non aligned, etc... arrays ?
The code I posted should deal with it - by sorting the indices by decreasing stride, I simply ensure that all (source and target) segments are traversed in order of increasing memory addresses. It does not affect segments or alignment.
All this has to be taken care of,
Right - my "perform_operation(aprime)" step would need to apply the operation on every memory segment.
and this means handling reference count and other things which are always delicate to handle well...
I am not sure that I understand where refcounting comes into play here.
Or you just use the current situation, which let python handle it (through PyObject_Call and a callable, at a C level).
I need to look at the code to see what you mean here. Probably, I have a wrong picture of where the slowness comes from (I thought that the order of the loops was wrong).
As I wrote above, I don't think this is good. A fortran-order-contiguous array is still contiguous, and not inferior in any way to C-order arrays. So I actually expect copy() to return an array of unchanged order.
Maybe this behaviour was kept for compatibility with numarray ? If you look at the docstring, it is said that copy may not return the same order than its input. It kind of make sense to me: C order is the default of many numpy operations.
That is very sad, because Fortran order is much more natural for handling images, where you're absolutely used to index with [x,y], and x being the faster-changing index in memory. -- Ciao, / / /--/ / / ANS

Am Donnerstag, 08. November 2007 16:37:06 schrieb David Cournapeau:
The problem is not F vs C storage: for element-wise operation, it does not matter at all; you just apply the same function (perform_operation) over and over on every element of the array. The order does not matter at all.
Yet Fortran order leads to several times slower operations, see the figures in my original post. :-( This is because the current implementation for at least some of the operations you are talking about are using PyArray_GenericReduce and other similar functions, which are really high level (they use python callable, etc..). This is easier, because you don't have to care about anything (type, etc...), but this means that python runtime is handling all the thing. Instead, you should use a pure C implementation (by pure C, I mean a C function totally independant of
On Nov 9, 2007 12:50 AM, Hans Meine <meine@informatik.uni-hamburg.de> wrote: python, only dealing with standard C types). This would already lead a significant performance leap. Once you do that, you should not see difference between F and C storage, normally.
But what if you have segmented buffers, non aligned, etc... arrays ?
The code I posted should deal with it - by sorting the indices by decreasing stride, I simply ensure that all (source and target) segments are traversed in order of increasing memory addresses. It does not affect segments or alignment.
If you have segmented addresses, I don't think the ordering matters much anymore, for memory access, no ? For example,
All this has to be taken care of,
Right - my "perform_operation(aprime)" step would need to apply the operation on every memory segment.
and this means handling reference count and other things which are always delicate to handle well...
I am not sure that I understand where refcounting comes into play here.
Or you just use the current situation, which let python handle it (through PyObject_Call and a callable, at a C level).
I need to look at the code to see what you mean here. Probably, I have a wrong picture of where the slowness comes from (I thought that the order of the loops was wrong).
As I wrote above, I don't think this is good. A fortran-order-contiguous array is still contiguous, and not inferior in any way to C-order arrays. So I actually expect copy() to return an array of unchanged order.
Maybe this behaviour was kept for compatibility with numarray ? If you look at the docstring, it is said that copy may not return the same order than its input. It kind of make sense to me: C order is the default of many numpy operations.
That is very sad, because Fortran order is much more natural for handling images, where you're absolutely used to index with [x,y], and x being the faster-changing index in memory.
-- Ciao, / / /--/ / / ANS _______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion

On Nov 9, 2007 1:31 AM, David Cournapeau <cournape@gmail.com> wrote:
Am Donnerstag, 08. November 2007 16:37:06 schrieb David Cournapeau:
The problem is not F vs C storage: for element-wise operation, it does not matter at all; you just apply the same function (perform_operation) over and over on every element of the array. The order does not matter at all.
Yet Fortran order leads to several times slower operations, see the figures in my original post. :-( This is because the current implementation for at least some of the operations you are talking about are using PyArray_GenericReduce and other similar functions, which are really high level (they use python callable, etc..). This is easier, because you don't have to care about anything (type, etc...), but this means that python runtime is handling all the thing. Instead, you should use a pure C implementation (by pure C, I mean a C function totally independant of
On Nov 9, 2007 12:50 AM, Hans Meine <meine@informatik.uni-hamburg.de> wrote: python, only dealing with standard C types). This would already lead a significant performance leap.
Once you do that, you should not see difference between F and C storage, normally.
But what if you have segmented buffers, non aligned, etc... arrays ?
The code I posted should deal with it - by sorting the indices by decreasing stride, I simply ensure that all (source and target) segments are traversed in order of increasing memory addresses. It does not affect segments or alignment.
If you have segmented addresses, I don't think the ordering matters much anymore, for memory access, no ? For example,
All this has to be taken care of,
Right - my "perform_operation(aprime)" step would need to apply the operation on every memory segment.
and this means handling reference count and other things which are always delicate to handle well...
I am not sure that I understand where refcounting comes into play here.
Or you just use the current situation, which let python handle it (through PyObject_Call and a callable, at a C level).
I need to look at the code to see what you mean here. Probably, I have a wrong picture of where the slowness comes from (I thought that the order of the loops was wrong).
As I wrote above, I don't think this is good. A fortran-order-contiguous array is still contiguous, and not inferior in any way to C-order arrays. So I actually expect copy() to return an array of unchanged order.
Maybe this behaviour was kept for compatibility with numarray ? If you look at the docstring, it is said that copy may not return the same order than its input. It kind of make sense to me: C order is the default of many numpy operations.
That is very sad, because Fortran order is much more natural for handling images, where you're absolutely used to index with [x,y], and x being the faster-changing index in memory.
-- Ciao, / / /--/ / / ANS _______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion

Am Donnerstag, 08. November 2007 17:31:40 schrieb David Cournapeau:
This is because the current implementation for at least some of the operations you are talking about are using PyArray_GenericReduce and other similar functions, which are really high level (they use python callable, etc..). This is easier, because you don't have to care about anything (type, etc...), but this means that python runtime is handling all the thing.
I suspected that after your last post, but that's really bad for pointwise operations on a contiguous, aligned array. A simple transpose() should really not make any difference here.
Instead, you should use a pure C implementation (by pure C, I mean a C function totally independant of python, only dealing with standard C types). This would already lead a significant performance leap.
AFAICS, it would be much more elegant and easier to implement this using C++ templates. We have a lot of experience with such a design from our VIGRA library ( http://kogs-www.informatik.uni-hamburg.de/~koethe/vigra/ ), which is an imaging library based on the STL concepts (and some necessary and convenient extensions for higher-dimensional arrays and a more flexible API). I am not very keen on writing hundreds of lines of C code for things that can easily be handled with C++ functors. But I don't think that I am the first to propose this, and I know that C has some advantages (faster compilation; are there more? ;-) ) - what is the opinion on this in the SciPy community?
If you have segmented addresses, I don't think the ordering matters much anymore, for memory access, no ?
Yes, I think it does. It probably depends on the sizes of the segments though. If you have a multi-segment box-sub-range of a large dataset (3D volume or even only 2D), processing each contiguous "row" (column/...) at once within the inner loop definitely makes a difference. I.e. as long as one dimension is not strided (and the data's extent in this dimension is not too small), it should be handled in the inner loop. The other loops probably don't make a big difference. -- Ciao, / / /--/ / / ANS

Am Donnerstag, 08. November 2007 17:31:40 schrieb David Cournapeau:
This is because the current implementation for at least some of the operations you are talking about are using PyArray_GenericReduce and other similar functions, which are really high level (they use python callable, etc..). This is easier, because you don't have to care about anything (type, etc...), but this means that python runtime is handling all the thing.
I suspected that after your last post, but that's really bad for pointwise operations on a contiguous, aligned array. A simple transpose() should really not make any difference here. It should not, but in practise, it is not so easy to do. AFAIK, even matlab has the same problem, with less difference, though. And they have much more ressources than numpy. Not to say that we cannot do better, but this takes time.
Instead, you should use a pure C implementation (by pure C, I mean a C function totally independant of python, only dealing with standard C types). This would already lead a significant performance leap.
AFAICS, it would be much more elegant and easier to implement this using C++ templates. We have a lot of experience with such a design from our VIGRA library ( http://kogs-www.informatik.uni-hamburg.de/~koethe/vigra/ ), which is an imaging library based on the STL concepts (and some necessary and convenient extensions for higher-dimensional arrays and a more flexible API).
I am not very keen on writing hundreds of lines of C code for things that can easily be handled with C++ functors. But I don't think that I am the first to propose this, and I know that C has some advantages (faster compilation; are there more? ;-) ) The advantages of C over C++ are much more than compilation speed: it is actually portable, is simple, and easily callable from other languages, 3 significant points that C++ does not meet at all. Generally, I don't see much benefit for using low level C++ in numerical code ; I consider most numerical-based container a total failure (the fact that after 20 years, there is still nothing close to a standard for even a matrix concept in C++ is for me quite striking). I think C++ 's aspect which would be more useful in numpy is RAII (I must confess that I personnally don't like C++ very much, in
On Nov 9, 2007 1:55 AM, Hans Meine <meine@informatik.uni-hamburg.de> wrote: particular when template are involved). This is only my opinion, I don't know the opinion on other people on this.
Yes, I think it does. It probably depends on the sizes of the segments though. If you have a multi-segment box-sub-range of a large dataset (3D volume or even only 2D), processing each contiguous "row" (column/...) at once within the inner loop definitely makes a difference. I.e. as long as one dimension is not strided (and the data's extent in this dimension is not too small), it should be handled in the inner loop. The other loops probably don't make a big difference.
If you have contiguous segments in subranges, then this is already handled through the ufunc mechanism, but I don't know anything about their actual implementation. Other people much more knowledgable than me can give you more info here. David

This discussion makes me wonder if the basic element-wise operations could (should?) be special cased for contiguous arrays, reducing them to simple pointer incrementing from the start to the finish of the data block. The same code would work for C and Fortran order arrays, and be pretty simple. This would address Hans' issue, no? It's a special case but a common one. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

On Nov 9, 2007 3:28 AM, Christopher Barker <Chris.Barker@noaa.gov> wrote:
This discussion makes me wonder if the basic element-wise operations could (should?) be special cased for contiguous arrays, reducing them to simple pointer incrementing from the start to the finish of the data block. I don't see why it could not. I think that most of the implementation could be common to most functions, because they share a lot (handling out argument, axis argument, and sometimes dtype argument). The only different part would be the actual compuation by a C function, which is trivial as you pointed.
cheers, David

On Donnerstag 08 November 2007, Christopher Barker wrote:
This discussion makes me wonder if the basic element-wise operations could (should?) be special cased for contiguous arrays, reducing them to simple pointer incrementing from the start to the finish of the data block. The same code would work for C and Fortran order arrays, and be pretty simple.
This is exactly what I am thinking and talking about. I did not want to imply that NumPy is bad because it does not yet do so - it's an impressive piece of well-designed software AFAICS - but I do think that this is a really common and important use case that should not stay as inefficiently handled as it is now. Ciao, / / .o. /--/ ..o / / ANS ooo

Christopher Barker wrote:
This discussion makes me wonder if the basic element-wise operations could (should?) be special cased for contiguous arrays, reducing them to simple pointer incrementing from the start to the finish of the data block. The same code would work for C and Fortran order arrays, and be pretty simple.
This would address Hans' issue, no?
It's a special case but a common one.
There is a special case for this already. It's just that the specific operations he is addressing requires creation of output arrays that by default are in C-order. This would need to change in order to take advantage of the special case. -Travis O.

Travis E. Oliphant wrote:
Christopher Barker wrote:
This discussion makes me wonder if the basic element-wise operations could (should?) be special cased for contiguous arrays, reducing them to simple pointer incrementing from the start to the finish of the data block. The same code would work for C and Fortran order arrays, and be pretty simple.
This would address Hans' issue, no?
It's a special case but a common one.
There is a special case for this already. It's just that the specific operations he is addressing requires creation of output arrays that by default are in C-order. This would need to change in order to take advantage of the special case. For copy and array creation, I understand this, but for element-wise operations (mean, min, and max), this is not enough to explain the difference, no ? For example, I can understand a 50 % or 100 % time increase for simple operations (by simple, I mean one element operation taking only a few CPU cycles), because of copies, but a 5 fold time increase seems too big, no (mayb a cache problem, though) ? Also, the fact that mean is slower than min/max for both cases (F vs C) seems a bit counterintuitive (maybe cache effects are involved somehow ?).
Again, I see huge differences between my Xeon PIV @ 3.2 Ghz and my pentium M @ 1.2 Ghz for those operations: pentium M gives more "intuitive results (and is almost as fast, and sometimes even faster than my Xeon for arrays which can stay in cache). cheers, David

David Cournapeau wrote:
Travis E. Oliphant wrote:
Christopher Barker wrote:
This discussion makes me wonder if the basic element-wise operations could (should?) be special cased for contiguous arrays, reducing them to simple pointer incrementing from the start to the finish of the data block. The same code would work for C and Fortran order arrays, and be pretty simple.
This would address Hans' issue, no?
It's a special case but a common one.
There is a special case for this already. It's just that the specific operations he is addressing requires creation of output arrays that by default are in C-order. This would need to change in order to take advantage of the special case.
For copy and array creation, I understand this, but for element-wise operations (mean, min, and max), this is not enough to explain the difference, no ? For example, I can understand a 50 % or 100 % time increase for simple operations (by simple, I mean one element operation taking only a few CPU cycles), because of copies, but a 5 fold time increase seems too big, no (mayb a cache problem, though) ? Also, the fact that mean is slower than min/max for both cases (F vs C) seems a bit counterintuitive (maybe cache effects are involved somehow ?).
Again, I see huge differences between my Xeon PIV @ 3.2 Ghz and my pentium M @ 1.2 Ghz for those operations: pentium M gives more "intuitive results (and is almost as fast, and sometimes even faster than my Xeon for arrays which can stay in cache).
I wasn't talking about the min, mean, and max methods specifically. These are all implemented with the reduce method of a ufunc. But, there are special cases for the reduce method as well and so relatively smooth pathways for optimization. -Travis O.

Travis E. Oliphant wrote:
I wasn't talking about the min, mean, and max methods specifically. These are all implemented with the reduce method of a ufunc.
Ah, my mistake, I wrongly understood only some of them were implemented through ufunc. But the ufunc machinery has nothing to do with output array output, right ? So is the 5 time speed increase mainly happening inside ufunc ? David

David Cournapeau wrote:
Travis E. Oliphant wrote:
I wasn't talking about the min, mean, and max methods specifically. These are all implemented with the reduce method of a ufunc.
Ah, my mistake, I wrongly understood only some of them were implemented through ufunc. But the ufunc machinery has nothing to do with output array output, right ? So is the 5 time speed increase mainly happening inside ufunc ?
I'm not sure. Ufuncs don't do everything in NumPy as you have noted. There are less well publicized (and less structured) array functions for each data-type that also implement things. Not all of these places have "optimized" paths, but there was some thought given to it in several places of the code. One area that could be improved that may be exposed in some of the timing tests, is that iterator-based algorithms always use a "C-contiguous" iterator. There is no "Fortran-contiguous" iterator (although there could be). -Travis O.

On 08/11/2007, David Cournapeau <david@ar.media.kyoto-u.ac.jp> wrote:
For copy and array creation, I understand this, but for element-wise operations (mean, min, and max), this is not enough to explain the difference, no ? For example, I can understand a 50 % or 100 % time increase for simple operations (by simple, I mean one element operation taking only a few CPU cycles), because of copies, but a 5 fold time increase seems too big, no (mayb a cache problem, though) ? Also, the fact that mean is slower than min/max for both cases (F vs C) seems a bit counterintuitive (maybe cache effects are involved somehow ?).
I have no doubt at all that cache effects are involved: for an int array, each data element is four bytes, but typical CPUs need to load 64 bytes at a time into cache. If you read (or write) the rest of those bytes in the next iterations through the loop, the (large) cost of a memory read is amortized. If you jump to the next row of the array, some large number of bytes away, those 64 bytes basically need to be purged to make room for another 64 bytes, of which you'll use 4. If you're reading from a FORTRAN-order array and writing to a C-order one, there's no way around doing this on one end or another: you're effectively doing a transpose, which is pretty much always expensive. Is there any reason not to let ufuncs pick whichever order for newly-allocated arrays they want? The natural choice would be the same as the bigger input array, if it's a contiguous block of memory (whether or not the contiguous flags are set). Failing that, the same as the other input array (if any); failing that, C order's as good a default as any. How difficult would this be to implement? Anne

On 08/11/2007, David Cournapeau <david@ar.media.kyoto-u.ac.jp> wrote:
For copy and array creation, I understand this, but for element-wise operations (mean, min, and max), this is not enough to explain the difference, no ? For example, I can understand a 50 % or 100 % time increase for simple operations (by simple, I mean one element operation taking only a few CPU cycles), because of copies, but a 5 fold time increase seems too big, no (mayb a cache problem, though) ? Also, the fact that mean is slower than min/max for both cases (F vs C) seems a bit counterintuitive (maybe cache effects are involved somehow ?).
I have no doubt at all that cache effects are involved: for an int array, each data element is four bytes, but typical CPUs need to load 64 bytes at a time into cache. If you read (or write) the rest of those bytes in the next iterations through the loop, the (large) cost of a memory read is amortized. If you jump to the next row of the array, some large number of bytes away, those 64 bytes basically need to be purged to make room for another 64 bytes, of which you'll use 4. If you're reading from a FORTRAN-order array and writing to a C-order one, there's no way around doing this on one end or another: you're effectively doing a transpose, which is pretty much always expensive. This is obviously part of the picture, and there is indeed no way around
Anne Archibald wrote: the simple fact that on sequential memory access is extremely slow on modern hardware. The fact that the array iterator does not support (yet) the Fortran order would largely explain this (I wrongly assumed the iterator already did that).
Is there any reason not to let ufuncs pick whichever order for newly-allocated arrays they want? The natural choice would be the same as the bigger input array, if it's a contiguous block of memory (whether or not the contiguous flags are set). Failing that, the same as the other input array (if any); failing that, C order's as good a default as any. How difficult would this be to implement?
I think part of the problem is that it is difficult to know which is faster a priori (is copying multi-segmented parts in a single buffer faster for processing faster than direct processing ?). As mentioned previously, on the same platform (same OS/compiler combination), there are already huge differences between two CPU (pentium4 vs pentium M, the former being noticeably pig-slow when SSE FPU is not used). Does ufunc use buffer for segmented memory, or do they only use them for misbehaved arrays ? Also, from my very limited experience, I found array iterators to be significantly slower than simple array indexing on contiguous, single segment arrays. Do ufunc always use array iterator, or do they use simple array indexing when possible ? When I implemented the fast clip function (which does not use ufunc), I noticed that the following matter: - avoiding unnecessary copies. - quickly determine which cases can be optimized wrt to the operations you are using (memmove vs memcpy, etc...) - fast memory copying (numpy do not have those yet: by using optimized memory copying, you can often gain a 50 % speed increase on recent archs; also, having information about alignment could help, too, and we go back to aligned allocators discussion :) ). The problem is to find a good API to put those optimizations at one place. I unfortunately do not know much about ufunc, maybe they already implement most of those. cheers, David

Hi!
I wonder why simple elementwise operations like "a * 2" or "a + 1" are not performed in order of increasing memory addresses in order to exploit CPU caches etc. C-order is "special" in NumPy due to the history. I agree that it doesn't need to be and we have taken significant steps in that
Hans Meine wrote: direction. Right now, the fundamental problem is probably due to the fact that the output arrays are being created as C-order arrays when the input is a Fortran order array. Once that is changed then we can cause Fortran-order inputs to use the simplest path through the code. Fortran order arrays can be preserved but it takes a little extra work because backward compatible expectations had to be met. See for example the order argument to the copy method of arrays.
- as it is now, their speed drops by a factor of around 3 simply by transpose()ing. Similarly (but even less logical), copy() and even the constructor are affected (yes, I understand that copy() creates contiguous arrays, but shouldn't it respect/retain the order nevertheless?):
As mentioned, it can preserve order with the 'order' argument a.copy('A')
### constructor ### In [89]: %timeit -r 10 -n 1000000 numpy.ndarray((3,3,3)) 1000000 loops, best of 10: 1.19 s per loop
In [90]: %timeit -r 10 -n 1000000 numpy.ndarray((3,3,3), order="f") 1000000 loops, best of 10: 2.19 s per loop
I bet what you are seeing here is simply the overhead of processing the order argument. Try the first one with order='c' to see what I mean.
### copy 3x3x3 array ### In [85]: a = numpy.ndarray((3,3,3))
In [86]: %timeit -r 10 a.copy() 1000000 loops, best of 10: 1.14 s per loop
In [87]: a = numpy.ndarray((3,3,3), order="f")
In [88]: %timeit -r 10 -n 1000000 a.copy() 1000000 loops, best of 10: 3.39 s per loop
Use the 'a' argument to allow copying in "fortran" order.
### copy 256x256x256 array ### In [74]: a = numpy.ndarray((256,256,256))
In [75]: %timeit -r 10 a.copy() 10 loops, best of 10: 119 ms per loop
In [76]: a = numpy.ndarray((256,256,256), order="f")
In [77]: %timeit -r 10 a.copy() 10 loops, best of 10: 274 ms per loop
Same thing here. Nobody is opposed to having faster code as long as we don't in the process break code bases. There is also the matter of implementation.... -Travis O.

Am Freitag, 09. November 2007 00:16:12 schrieb Travis E. Oliphant:
C-order is "special" in NumPy due to the history. I agree that it doesn't need to be and we have taken significant steps in that direction.
Thanks for this clarifying statement.
Right now, the fundamental problem is probably due to the fact that the output arrays are being created as C-order arrays when the input is a Fortran order array. Once that is changed then we can cause Fortran-order inputs to use the simplest path through the code.
That looks like a plan. I have read about __array_wrap__ in your book, and it sounds exactly like what we need for our imaging library; this way we can ensure that the result of operations on images is again an image. Here, it is crucial that the resulting image has Fortran order, too (for C/C++ algorithms to agree with the Python idea of the images' orientation).
Fortran order arrays can be preserved but it takes a little extra work because backward compatible expectations had to be met. See for example the order argument to the copy method of arrays.
What do you mean exactly (if you have something specific in mind at all)? Should the order be configurable for all operations, and default to "C" for backwards compatibility? (BTW: copy() has no arguments yet in my numpybook, page 57.) At last, I had a look at the relevant code for adding arrays; this is what I found: In core/src/umathmodule.c.src, the code templates for the relevant ufuncs (e.g. add) are defined, which will get expanded to e.g. DOUBLE_add. This will get wrapped into an ufunc in 'InitOperators' using f = PyUFunc_FromFuncAndData(add_functions, add_data, add_signatures, 18, 2, 1, PyUFunc_Zero, "add", "adds the arguments elementwise.", 0); and inserted as "add" into a dictionary which is later passed to PyArray_SetNumericOps. This will install the ufunc in the static NumericOps struct 'n_ops', which is used in the 'array_add' function installed in the PyTypeObject. Thus, it is the ufunc code (not suprisingly), which seems to call DOUBLE_add for the inner loop: | static void | DOUBLE_add(char **args, intp *dimensions, intp *steps, void *func) | { | register intp i; | intp is1=steps[0],is2=steps[1],os=steps[2], n=dimensions[0]; | char *i1=args[0], *i2=args[1], *op=args[2]; | for(i=0; i<n; i++, i1+=is1, i2+=is2, op+=os) { | *((double *)op)=*((double *)i1) + *((double *)i2); | } | } If I understood David correctly, he suggested an unstrided variant of this inner loop, which simply uses pointer increments. While this is a good idea (also probably quite some work), the real thing bugging me is that the above DOUBLE_add could (and should!) be called by the ufunc framework in such a way that it is equally efficient for C and Fortran arrays. (I.e. with steps set to sizeof(double), without prior copying/conversion.) It could even be used with negative step sizes when the input and output arrays have different orders. Disclaimer: So far, I did not look at the ufunc code yet, so I do not know which code paths exist. BTW: What are the "void *func" in core/src/umathmodule.c.src arguments for? Shouldn't that be "void * /*func*/", since the parameters are never used? -- Ciao, / / /--/ / / ANS

Hi, this comment might be counterproductive to this discussion: However. I'm also using numpy as a basis for putting together an image analysis "platform"/library. This includes both 2D and higher dimensional images. Since all my code, if not n Python, is written in C or C++ and not Fortran, I decided early on that I had to get used to "invese indexing", as in image[y,x] or image[z,y,x] "Inverse" only refers to the common (alphabetecally oriented) intuition to say "at point x,y,z" rather than "z,y,x". I also argueed that mathematical matrices (mostly / often) use an index order as "row , column" which essentially is "y, x" (if the matrix is seen as an image) In any case, getting used to a "z,y,x" order saved me lots of technical troubles, and after all, most code is or will be C/C++ and not Fortran. Hope I was not entirey misguided, Sebastian Haase On Nov 9, 2007 12:37 PM, Hans Meine <meine@informatik.uni-hamburg.de> wrote:
Am Freitag, 09. November 2007 00:16:12 schrieb Travis E. Oliphant:
C-order is "special" in NumPy due to the history. I agree that it doesn't need to be and we have taken significant steps in that direction.
Thanks for this clarifying statement.
Right now, the fundamental problem is probably due to the fact that the output arrays are being created as C-order arrays when the input is a Fortran order array. Once that is changed then we can cause Fortran-order inputs to use the simplest path through the code.
That looks like a plan. I have read about __array_wrap__ in your book, and it sounds exactly like what we need for our imaging library; this way we can ensure that the result of operations on images is again an image. Here, it is crucial that the resulting image has Fortran order, too (for C/C++ algorithms to agree with the Python idea of the images' orientation).
Fortran order arrays can be preserved but it takes a little extra work because backward compatible expectations had to be met. See for example the order argument to the copy method of arrays.
What do you mean exactly (if you have something specific in mind at all)? Should the order be configurable for all operations, and default to "C" for backwards compatibility? (BTW: copy() has no arguments yet in my numpybook, page 57.)
At last, I had a look at the relevant code for adding arrays; this is what I found: In core/src/umathmodule.c.src, the code templates for the relevant ufuncs (e.g. add) are defined, which will get expanded to e.g. DOUBLE_add. This will get wrapped into an ufunc in 'InitOperators' using
f = PyUFunc_FromFuncAndData(add_functions, add_data, add_signatures, 18, 2, 1, PyUFunc_Zero, "add", "adds the arguments elementwise.", 0);
and inserted as "add" into a dictionary which is later passed to PyArray_SetNumericOps. This will install the ufunc in the static NumericOps struct 'n_ops', which is used in the 'array_add' function installed in the PyTypeObject.
Thus, it is the ufunc code (not suprisingly), which seems to call DOUBLE_add for the inner loop:
| static void | DOUBLE_add(char **args, intp *dimensions, intp *steps, void *func) | { | register intp i; | intp is1=steps[0],is2=steps[1],os=steps[2], n=dimensions[0]; | char *i1=args[0], *i2=args[1], *op=args[2]; | for(i=0; i<n; i++, i1+=is1, i2+=is2, op+=os) { | *((double *)op)=*((double *)i1) + *((double *)i2); | } | }
If I understood David correctly, he suggested an unstrided variant of this inner loop, which simply uses pointer increments. While this is a good idea (also probably quite some work), the real thing bugging me is that the above DOUBLE_add could (and should!) be called by the ufunc framework in such a way that it is equally efficient for C and Fortran arrays. (I.e. with steps set to sizeof(double), without prior copying/conversion.) It could even be used with negative step sizes when the input and output arrays have different orders. Disclaimer: So far, I did not look at the ufunc code yet, so I do not know which code paths exist.
BTW: What are the "void *func" in core/src/umathmodule.c.src arguments for? Shouldn't that be "void * /*func*/", since the parameters are never used?
-- Ciao, / / /--/ / / ANS
_______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion

Am Freitag, 09. November 2007 13:04:24 schrieb Sebastian Haase:
Since all my code, if not n Python, is written in C or C++ and not Fortran, I decided early on that I had to get used to "invese indexing", as in image[y,x] or image[z,y,x]
We cannot do that here, since a) we use the opposite order in our C++ VIGRA, using operator(): img(x, y) is similar to img[y][x] in other libraries, and b) we have had Python bindings for quite some time (only not published), where we used img[x, y] already. Of course, we want to have the same order in both languages in order to facilitate porting algorithms after rapid prototyping.
"Inverse" only refers to the common (alphabetecally oriented) intuition to say "at point x,y,z" rather than "z,y,x".
Yes, this is the typical notation for coordinates and vectors. Nobody would write (z, y, x) vectors. ;-)
I also argueed that mathematical matrices (mostly / often) use an index order as "row , column" which essentially is "y, x" (if the matrix is seen as an image)
That's right, it's different between matrices and images. In fact, we use "C" order for matrices in VIGRA: http://kogs-www.informatik.uni-hamburg.de/~koethe/vigra/doc/vigra/classvigra... -- Ciao, / / /--/ / / ANS

On Nov 9, 2007 1:43 PM, Hans Meine <meine@informatik.uni-hamburg.de> wrote:
Am Freitag, 09. November 2007 13:04:24 schrieb Sebastian Haase: <snip> Of course, we want to have the same order in both languages in order to facilitate porting algorithms after rapid prototyping.
Yes, I understand your situation.
"Inverse" only refers to the common (alphabetecally oriented) intuition to say "at point x,y,z" rather than "z,y,x".
Yes, this is the typical notation for coordinates and vectors. Nobody would write (z, y, x) vectors. ;-)
This is of course my fate now - you have to choose your poisson. ;-) I find it worse having to think in Fortran-order ... That's were I was saying "I hope I wasn't misguided" Cheers, Sebastian

Hans Meine wrote:
Fortran order arrays can be preserved but it takes a little extra work because backward compatible expectations had to be met. See for example the order argument to the copy method of arrays.
What do you mean exactly (if you have something specific in mind at all)?
I meant a.copy('A') will preserver the order of 'a' on copy.
Should the order be configurable for all operations, and default to "C" for backwards compatibility?
Yes, basically that should be the case when it matters.
(BTW: copy() has no arguments yet in my numpybook, page 57.)
At last, I had a look at the relevant code for adding arrays; this is what I found: In core/src/umathmodule.c.src, the code templates for the relevant ufuncs (e.g. add) are defined, which will get expanded to e.g. DOUBLE_add. This will get wrapped into an ufunc in 'InitOperators' using
f = PyUFunc_FromFuncAndData(add_functions, add_data, add_signatures, 18, 2, 1, PyUFunc_Zero, "add", "adds the arguments elementwise.", 0);
and inserted as "add" into a dictionary which is later passed to PyArray_SetNumericOps. This will install the ufunc in the static NumericOps struct 'n_ops', which is used in the 'array_add' function installed in the PyTypeObject.
Yep, you've got the code path correct.
Thus, it is the ufunc code (not suprisingly), which seems to call DOUBLE_add for the inner loop:
| static void | DOUBLE_add(char **args, intp *dimensions, intp *steps, void *func) | { | register intp i; | intp is1=steps[0],is2=steps[1],os=steps[2], n=dimensions[0]; | char *i1=args[0], *i2=args[1], *op=args[2]; | for(i=0; i<n; i++, i1+=is1, i2+=is2, op+=os) { | *((double *)op)=*((double *)i1) + *((double *)i2); | } | }
If I understood David correctly, he suggested an unstrided variant of this inner loop, which simply uses pointer increments. I'm not sure if he was saying that or not. But, more generally, I'm all for optimizations at this level. But, they will have to be done differently for different cases with this loop as the fall-back. While this is a good idea (also probably quite some work), the real thing bugging me is that the above DOUBLE_add could (and should!) be called by the ufunc framework in such a way that it is equally efficient for C and Fortran arrays. Yes, that's what I was talking about. There is actually a path through
Indeed. The ufuncs store inner loops for all the data-types that are supported by the ufunc. These inner loops allow for "strided" memory access and are called by the general ufunc code. The loop structure holds the information needed to loop over the ufuncs. the ufunc code where this loop is called only once. The requirement right now is that all the arrays are C-contiguous, but this should be changed to all arrays have the same contiguousness (and the output-array creation code changed to create Fortran-order arrays when the inputs are all Fortran-order).
BTW: What are the "void *func" in core/src/umathmodule.c.src arguments for? Shouldn't that be "void * /*func*/", since the parameters are never used?
Sure, it could be void * /* func*/ You'll have to forgive my ignorance, but does that matter? Are we using up a register or something by naming an unused argument? -Travis O.

On Freitag 09 November 2007, Travis E. Oliphant wrote:
While this is a good idea (also probably quite some work), the real thing bugging me is that the above DOUBLE_add could (and should!) be called by the ufunc framework in such a way that it is equally efficient for C and Fortran arrays.
Yes, that's what I was talking about. There is actually a path through the ufunc code where this loop is called only once. The requirement right now is that all the arrays are C-contiguous, but this should be changed to all arrays have the same contiguousness (and the output-array creation code changed to create Fortran-order arrays when the inputs are all Fortran-order).
After some measurements, I must say that even the slower Fortran variant is competitive (read: faster ;-) ), compared with our very flexible dynamic functors used in the current interactive VIGRA. IOW: Good job. :-)
BTW: What are the "void *func" in core/src/umathmodule.c.src arguments for? Shouldn't that be "void * /*func*/", since the parameters are never used?
Sure, it could be void * /* func*/
You'll have to forgive my ignorance, but does that matter? Are we using up a register or something by naming an unused argument?
No, but it triggers a compiler warning ("unused parameter foo", if warnings are enabled), which I even think to be useful. At least in g++, but that obviously showed my ignorance of C, since gcc -W does not complain. ;-) BTW: I am getting an interesting warning for numpy/core/src/scalartypes.inc.src:288 (empty 'if' body), which indeed looks suspicious. The rest of the warnings can be neglected I AFAICS. Ciao, / / .o. /--/ ..o / / ANS ooo

On Nov 10, 2007 4:39 AM, Hans Meine <meine@informatik.uni-hamburg.de> wrote:
On Freitag 09 November 2007, Travis E. Oliphant wrote:
While this is a good idea (also probably quite some work), the real thing bugging me is that the above DOUBLE_add could (and should!) be called by the ufunc framework in such a way that it is equally efficient for C and Fortran arrays.
Yes, that's what I was talking about. There is actually a path through the ufunc code where this loop is called only once. The requirement right now is that all the arrays are C-contiguous, but this should be changed to all arrays have the same contiguousness (and the output-array creation code changed to create Fortran-order arrays when the inputs are all Fortran-order).
After some measurements, I must say that even the slower Fortran variant is competitive (read: faster ;-) ), compared with our very flexible dynamic functors used in the current interactive VIGRA. IOW: Good job. :-)
I am not suprised at all: although I've seen many times people saying that C++ can provide good abstractions without cost, this is not my experience, at least with g++. Whatever the method/library used (blitz, the ones in boost, etc...). I am really convinced that C++ provides the bad abstraction for numerical works (and the fundamental design decision of C++ to avoid extending the language to implement things in libraries instead is flawed, at least for numerical works). cheers, David

On Samstag 10 November 2007, David Cournapeau wrote:
After some measurements, I must say that even the slower Fortran variant is competitive (read: faster ;-) ), compared with our very flexible dynamic functors used in the current interactive VIGRA. IOW: Good job. :-)
I am not suprised at all: although I've seen many times people saying that C++ can provide good abstractions without cost, this is not my experience, at least with g++.
Oh, I tried to prevent this misunderstanding; this is about the interactive VIGRA. Here, we have a functor grammar parser which composes functors (objects with virtual methods etc.) according to the syntax tree, in order to compute complex operations in one go. In case you're interested in our API, here's the relevant part of a mini tutorial I wrote lately: http://kogs-www.informatik.uni-hamburg.de/~meine/vigra_interactive/#processi... This is not to be confused with STL-like C++ functors; i.e. in the normal C++ VIGRA we use functors a lot, and our experience is very positive. E.g. our copyImage which gets two 2D source iterators (upper left, lower right), a source accessor (which may perform additional conversions or pull pixel components from different sources together), and a destination iterator + accessor has exactly the speed of a memcpy when standard (pass-through) accessors are used. (In fact, it was slightly faster, but most probably within measurement uncertainty.)
Whatever the method/library used (blitz, the ones in boost, etc...).
I like boost a lot, but I think in some places it may stress compilers too much. OTOH, my experience is that compilation takes ages, but the resulting code is quite fast. But that may depend on the (part of) library you're talking about.
I am really convinced that C++ provides the bad abstraction for numerical works (and the fundamental design decision of C++ to avoid extending the language to implement things in libraries instead is flawed, at least for numerical works).
I cannot follow you here, but I think there's not much point having yet another "which language is best" discussion, especially not when it becomes off-topic. I just wanted to tell you that my above statement is not an example that "C++ can [not] provide abstractions without cost". The same statement in C++ would be much faster than NumPy, but it's comparing apples and oranges (C++ vs. Python). I just compared our Python extension module with NumPy, and both use a very different approach (cf. the above link). Ciao, / / .o. /--/ ..o / / ANS ooo

On Nov 10, 2007 1:12 AM, Travis E. Oliphant <oliphant@enthought.com> wrote:
Hans Meine wrote:
| static void | DOUBLE_add(char **args, intp *dimensions, intp *steps, void *func) | { | register intp i; | intp is1=steps[0],is2=steps[1],os=steps[2], n=dimensions[0]; | char *i1=args[0], *i2=args[1], *op=args[2]; | for(i=0; i<n; i++, i1+=is1, i2+=is2, op+=os) { | *((double *)op)=*((double *)i1) + *((double *)i2); | } | }
If I understood David correctly, he suggested an unstrided variant of this inner loop, which simply uses pointer increments. I'm not sure if he was saying that or not. But, more generally, I'm all for optimizations at this level. But, they will have to be done differently for different cases with this loop as the fall-back.
I was not really clear, but yes, his was part of my argument: I don't think compilers can optimize the above well when there is no stride (more exactly when the stride could be done by simply using array indexing). This would need some benchmarks, but I have always read that using pointer arithmetics should be avoided when speed matters (e.g. *a + n * sizeof(*a) compared to a[n]), because it becomes much more difficult for the compiler to optimize, Generally, if you can get to a function which does the thing the "obvious way", this is better. Of course, you have to separate the case where this is possible and where it is not. But such work would also be really helpful if/when we optimize some basic things with MMX/SSE and co, and I think the above is impossible to auto vectorize (gcc 4.3, not released yet, gives some really helpful analysis for that, and can tell you when it fails to auto-vectorize, and why).
While this is a good idea (also probably quite some work), the real thing bugging me is that the above DOUBLE_add could (and should!) be called by the ufunc framework in such a way that it is equally efficient for C and Fortran arrays. Yes, that's what I was talking about. There is actually a path through the ufunc code where this loop is called only once. The requirement right now is that all the arrays are C-contiguous, but this should be changed to all arrays have the same contiguousness (and the output-array creation code changed to create Fortran-order arrays when the inputs are all Fortran-order).
This was my other point. For some of my own C code, that's what I do: I have some function to detect whether the array can be treated sequentially, for cases where C vs F does not matter at all. I don't see difficulty to provide such a facility in numpy, at least for the common operations we are talking about, but maybe I am missing something. I should take a look a the ufunc machinery David

On Nov 10, 2007 11:23 AM, David Cournapeau <cournape@gmail.com> wrote:
This would need some benchmarks, but I have always read that using pointer arithmetics should be avoided when speed matters (e.g. *a + n * sizeof(*a) compared to a[n]), because it becomes much more difficult for the compiler to optimize, Generally, if you can get to a function which does the thing the "obvious way", this is better. Of course, you have to separate the case where this is possible and where it is not. But such work would also be really helpful if/when we optimize some basic things with MMX/SSE and co, and I think the above is impossible to auto vectorize (gcc 4.3, not released yet, gives some really helpful analysis for that, and can tell you when it fails to auto-vectorize, and why).
Actually, gcc 4.2 already supports the option: -ftree-vectorizer-verbose=n. cheers, David
While this is a good idea (also probably quite some work), the real thing bugging me is that the above DOUBLE_add could (and should!) be called by the ufunc framework in such a way that it is equally efficient for C and Fortran arrays. Yes, that's what I was talking about. There is actually a path through the ufunc code where this loop is called only once. The requirement right now is that all the arrays are C-contiguous, but this should be changed to all arrays have the same contiguousness (and the output-array creation code changed to create Fortran-order arrays when the inputs are all Fortran-order).
This was my other point. For some of my own C code, that's what I do: I have some function to detect whether the array can be treated sequentially, for cases where C vs F does not matter at all. I don't see difficulty to provide such a facility in numpy, at least for the common operations we are talking about, but maybe I am missing something. I should take a look a the ufunc machinery
David

On Nov 9, 2007 7:23 PM, David Cournapeau <cournape@gmail.com> wrote:
On Nov 10, 2007 1:12 AM, Travis E. Oliphant <oliphant@enthought.com> wrote:
Hans Meine wrote:
| static void | DOUBLE_add(char **args, intp *dimensions, intp *steps, void *func) | { | register intp i; | intp is1=steps[0],is2=steps[1],os=steps[2], n=dimensions[0]; | char *i1=args[0], *i2=args[1], *op=args[2]; | for(i=0; i<n; i++, i1+=is1, i2+=is2, op+=os) { | *((double *)op)=*((double *)i1) + *((double *)i2); | } | }
If I understood David correctly, he suggested an unstrided variant of this inner loop, which simply uses pointer increments. I'm not sure if he was saying that or not. But, more generally, I'm all for optimizations at this level. But, they will have to be done differently for different cases with this loop as the fall-back.
I was not really clear, but yes, his was part of my argument: I don't think compilers can optimize the above well when there is no stride (more exactly when the stride could be done by simply using array indexing).
This would need some benchmarks, but I have always read that using pointer arithmetics should be avoided when speed matters (e.g. *a + n * sizeof(*a) compared to a[n]), because it becomes much more difficult for the compiler to optimize, Generally, if you can get to a function which does the thing the "obvious way", this is better. Of course, you have to separate the case where this is possible and where it is not. But such work would also be really helpful if/when we optimize some basic things with MMX/SSE and co, and I think the above is impossible to auto vectorize (gcc 4.3, not released yet, gives some really helpful analysis for that, and can tell you when it fails to auto-vectorize, and why).
The relative speed of pointers vs indexing depends on a lot of things: 1) the architecture (registers, instruction sets, implementation) 2) the compiler 3) the code (number of base addresses, etc.) My subjective feel is that on newer intel type hardware and using a recent compiler, indexing is generally faster. But it does depend. I wrote an indexed version of the code above: typedef int intp; void DOUBLE_add_ptr(char **args, intp *dimensions, intp *steps, void *func) { register intp i; intp is1=steps[0],is2=steps[1],os=steps[2], n=dimensions[0]; char *i1=args[0], *i2=args[1], *op=args[2]; for(i=0; i<n; i++, i1+=is1, i2+=is2, op+=os) { *((double *)op)=*((double *)i1) + *((double *)i2); } } void DOUBLE_add_ind(char **args, intp *dimensions, intp *steps, void *func) { const intp i1s=steps[0]/sizeof(double); const intp i2s=steps[1]/sizeof(double); const intp ios=steps[2]/sizeof(double); const n=dimensions[0]; double * const p1 = (double*)args[0]; double * const p2 = (double*)args[1]; double * const po = (double*)args[2]; intp i, io, i1, i2; for(i=0, io=0, i1=0, i2=0; i<n; ++i) { po[io] = p1[i1] + p2[i2]; io += ios; i1 += i1s; i2 += i2s; } } Compiled with gcc -O2 -fno-strict-aliasing -march=prescott -m32 -S, the inner loop of the pointer version looks like .L4: fldl (%edx) faddl (%ecx) fstpl (%eax) addl $1, %ebx addl -20(%ebp), %edx addl -16(%ebp), %ecx addl %edi, %eax cmpl %esi, %ebx jne .L4 While the indexed version looks like .L11: movl -24(%ebp), %esi fldl (%esi,%ecx,8) movl -20(%ebp), %esi faddl (%esi,%edx,8) movl -16(%ebp), %esi fstpl (%esi,%eax,8) addl -36(%ebp), %eax addl -32(%ebp), %ecx addl %edi, %edx addl $1, %ebx cmpl -28(%ebp), %ebx jne .L11 Note that the indexed version is rather short of registers and has to load all of the pointers and two of the indexes from the stack. Chuck
participants (8)
-
Anne Archibald
-
Charles R Harris
-
Christopher Barker
-
David Cournapeau
-
David Cournapeau
-
Hans Meine
-
Sebastian Haase
-
Travis E. Oliphant