[Numpy-discussion] Unnecessarily bad performance of elementwise operators with Fortran-arrays

Fri Nov 9 21:23:02 EST 2007

On Nov 10, 2007 1:12 AM, Travis E. Oliphant <oliphant at enthought.com> wrote:
> Hans Meine wrote:
> > | static void
> > | DOUBLE_add(char **args, intp *dimensions, intp *steps, void *func)
> > | {
> > |     register intp i;
> > |     intp is1=steps[0],is2=steps[1],os=steps[2], n=dimensions[0];
> > |     char *i1=args[0], *i2=args[1], *op=args[2];
> > |     for(i=0; i<n; i++, i1+=is1, i2+=is2, op+=os) {
> > |         *((double *)op)=*((double *)i1) + *((double *)i2);
> > |     }
> > | }
> >
> > If I understood David correctly, he suggested an unstrided variant of
> > this inner loop, which simply uses pointer increments.
> I'm not sure if he was saying that or not.  But, more generally, I'm all
> for optimizations at this level.  But, they will have to be done
> differently for different cases with this loop as the fall-back.

I was not really clear, but yes, his was part of my argument: I don't
think compilers can optimize the above well when there is no stride
(more exactly when the stride could be done by simply using array
indexing).

This would need some benchmarks, but I have always read that using
pointer arithmetics should be avoided when speed matters (e.g. *a + n
* sizeof(*a) compared to a[n]), because it becomes much more difficult
for the compiler to optimize, Generally, if you can get to a function
which does the thing the "obvious way", this is better. Of course, you
have to separate the case where this is possible and where it is not.
But such work would also be really helpful if/when we optimize some
basic things with MMX/SSE and co, and I think the above is impossible
to auto vectorize (gcc 4.3, not released yet, gives some really
helpful analysis for that, and can tell you when it fails to
auto-vectorize, and why).

> >  While this is
> > a good idea (also probably quite some work), the real thing bugging me is
> > that the above DOUBLE_add could (and should!) be called by the ufunc
> > framework in such a way that it is equally efficient for C and Fortran
> > arrays.
> Yes, that's what I was talking about.  There is actually a path through
> the ufunc code where this loop is called only once.  The requirement
> right now is that all the arrays are C-contiguous, but this should be
> changed to all arrays have the same contiguousness (and the output-array
> creation code changed to create Fortran-order arrays when the inputs are
> all Fortran-order).

This was my other point. For some of my own C code, that's what I do:
I have some function to detect whether the array can be treated
sequentially, for cases where C vs F does not matter at all. I don't
see difficulty to provide such a facility in numpy, at least for the
common operations we are talking about, but maybe I am missing
something. I should take a look a the ufunc machinery

David