Am 12.03.2015 um 13:48 schrieb Julian Taylor <jtaylor.debian@googlemail.com>:
On 03/12/2015 10:15 AM, Gregor Thalhammer wrote:
Another note, numpy makes it easy to provide new ufuncs, see http://docs.scipy.org/doc/numpy-dev/user/c-info.ufunc-tutorial.html from a C function that operates on 1D arrays, but this function needs to support arbitrary spacing (stride) between the items. Unfortunately, to achieve good performance, vector math libraries often expect that the items are laid out contiguously in memory. MKL/VML is a notable exception. So for non contiguous in- or output arrays you might need to copy the data to a buffer, which likely kills large amounts of the performance gain.
The elementary functions are very slow even compared to memory access, they take in the orders of hundreds to tens of thousand cycles to complete (depending on range and required accuracy). Even in the case of strided access that gives the hardware prefetchers plenty of time to load the data before the previous computation is done.
That might apply to the mathematical functions from the standard libraries, but that is not true for the optimized libraries. Typical numbers are 4-10 CPU cycles per operation, see e.g. https://software.intel.com/sites/products/documentation/doclib/mkl_sa/112/vm... The benchmarks at https://github.com/geggo/uvml <https://github.com/geggo/uvml> show that memory access to main memory limits the performance for the calculation of exp for large array sizes . This test was done quite some time ago, memory bandwidth now typically is higher, but also computational power.
This also removes the requirement from the library to provide a strided api, we can copy the strided data into a contiguous buffer and pass it to the library without losing much performance. It may not be optimal (e.g. a library can fine tune the prefetching better for the case where the hardware is not ideal) but most likely sufficient.
Copying the data to a small enough buffer so it fits into cache might add a few cycles, this already impacts performance significantly. Curious to see how much. Gregor
Figuring out how to best do it to get the best performance and still being flexible in what implementation is used is part of the challenge the student will face for this project. _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion