Re: [Numpy-discussion] Introductory mail and GSoc Project "Vector math library integration"

24 Mar 2015

...
Am 12.03.2015 um 13:48 schrieb Julian Taylor <jtaylor.debian@googlemail.com>:
On 03/12/2015 10:15 AM, Gregor Thalhammer wrote:
...
Another note, numpy makes it easy to provide new ufuncs, see 
http://docs.scipy.org/doc/numpy-dev/user/c-info.ufunc-tutorial.html
from a C function that operates on 1D arrays, but this function needs to
support arbitrary spacing (stride) between the items. Unfortunately, to
achieve good performance, vector math libraries often expect that the
items are laid out contiguously in memory. MKL/VML is a notable
exception. So for non contiguous in- or output arrays you might need to
copy the data to a buffer, which likely kills large amounts of the
performance gain.
The elementary functions are very slow even compared to memory access,
they take in the orders of hundreds to tens of thousand cycles to
complete (depending on range and required accuracy).
Even in the case of strided access that gives the hardware prefetchers
plenty of time to load the data before the previous computation is done.
That might apply to the mathematical functions from the standard libraries, but that is not true for the optimized libraries. Typical numbers are 4-10 CPU cycles per operation, see e.g. 
https://software.intel.com/sites/products/documentation/doclib/mkl_sa/112/vm...

The benchmarks at https://github.com/geggo/uvml <https://github.com/geggo/uvml> show that memory access to main memory limits the performance for the calculation of exp for large array sizes . This test was done quite some time ago, memory bandwidth now typically is higher, but also computational power.
...
This also removes the requirement from the library to provide a strided
api, we can copy the strided data into a contiguous buffer and pass it
to the library without losing much performance. It may not be optimal
(e.g. a library can fine tune the prefetching better for the case where
the hardware is not ideal) but most likely sufficient.
Copying the data to a small enough buffer so it fits into cache might add a few cycles, this already impacts performance significantly. Curious to see how much.

Gregor
...
Figuring out how to best do it to get the best performance and still
being flexible in what implementation is used is part of the challenge
the student will face for this project.
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion