[Numpy-discussion] strided copy unroll, benchmarks needed

Thu Jun 13 13:25:39 EDT 2013

hi,
I posted a pull with a minor change instructing the GCC compiler to
unroll the strided copy loops (gcc will almost never do that on its own,
not even on O3).

https://github.com/numpy/numpy/pull/3429
It improves performance of these copies by 20%-50% depending on the size
of the data copied (if it goes out of all cpu caches you don't gain
anything anymore) on a couple machines (amd phenom x4, intel core2duo,
xeon 7xxx/5xxx)

As overriding the compiler decision is always dodgy, I would like some
numbers on a couple of cpu types to decide if its really a good idea.
So if you have the time please try the pull and the benchmark in the
first comment and report the difference in performance between the pull
and the unchanged numpy git head in the PR.
please include your cpu, gcc version and architecture (32 bit or 64 bit).
The benchmark can be run with ipython:
irunner --ipython bench.py

Cheers,
Julian