[Numpy-discussion] Introductory mail and GSoc Project "Vector math library integration"

Gregor Thalhammer gregor.thalhammer at gmail.com
Tue Mar 24 05:32:04 EDT 2015

> Am 12.03.2015 um 13:48 schrieb Julian Taylor <jtaylor.debian at googlemail.com>:
> On 03/12/2015 10:15 AM, Gregor Thalhammer wrote:
>> Another note, numpy makes it easy to provide new ufuncs, see 
>> http://docs.scipy.org/doc/numpy-dev/user/c-info.ufunc-tutorial.html
>> from a C function that operates on 1D arrays, but this function needs to
>> support arbitrary spacing (stride) between the items. Unfortunately, to
>> achieve good performance, vector math libraries often expect that the
>> items are laid out contiguously in memory. MKL/VML is a notable
>> exception. So for non contiguous in- or output arrays you might need to
>> copy the data to a buffer, which likely kills large amounts of the
>> performance gain.
> The elementary functions are very slow even compared to memory access,
> they take in the orders of hundreds to tens of thousand cycles to
> complete (depending on range and required accuracy).
> Even in the case of strided access that gives the hardware prefetchers
> plenty of time to load the data before the previous computation is done.

That might apply to the mathematical functions from the standard libraries, but that is not true for the optimized libraries. Typical numbers are 4-10 CPU cycles per operation, see e.g. 

The benchmarks at https://github.com/geggo/uvml <https://github.com/geggo/uvml> show that memory access to main memory limits the performance for the calculation of exp for large array sizes . This test was done quite some time ago, memory bandwidth now typically is higher, but also computational power.

> This also removes the requirement from the library to provide a strided
> api, we can copy the strided data into a contiguous buffer and pass it
> to the library without losing much performance. It may not be optimal
> (e.g. a library can fine tune the prefetching better for the case where
> the hardware is not ideal) but most likely sufficient.

Copying the data to a small enough buffer so it fits into cache might add a few cycles, this already impacts performance significantly. Curious to see how much.


> Figuring out how to best do it to get the best performance and still
> being flexible in what implementation is used is part of the challenge
> the student will face for this project.
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20150324/3e90b099/attachment.html>

More information about the NumPy-Discussion mailing list