[Numpy-discussion] Introductory mail and GSoc Project "Vector math library integration"

Gregor Thalhammer gregor.thalhammer at gmail.com
Tue Mar 24 05:32:04 EDT 2015


> Am 12.03.2015 um 13:48 schrieb Julian Taylor <jtaylor.debian at googlemail.com>:
> 
> On 03/12/2015 10:15 AM, Gregor Thalhammer wrote:
>> 
>> Another note, numpy makes it easy to provide new ufuncs, see 
>> http://docs.scipy.org/doc/numpy-dev/user/c-info.ufunc-tutorial.html
>> from a C function that operates on 1D arrays, but this function needs to
>> support arbitrary spacing (stride) between the items. Unfortunately, to
>> achieve good performance, vector math libraries often expect that the
>> items are laid out contiguously in memory. MKL/VML is a notable
>> exception. So for non contiguous in- or output arrays you might need to
>> copy the data to a buffer, which likely kills large amounts of the
>> performance gain.
> 
> The elementary functions are very slow even compared to memory access,
> they take in the orders of hundreds to tens of thousand cycles to
> complete (depending on range and required accuracy).
> Even in the case of strided access that gives the hardware prefetchers
> plenty of time to load the data before the previous computation is done.
> 

That might apply to the mathematical functions from the standard libraries, but that is not true for the optimized libraries. Typical numbers are 4-10 CPU cycles per operation, see e.g. 
https://software.intel.com/sites/products/documentation/doclib/mkl_sa/112/vml/functions/_performanceall.html

The benchmarks at https://github.com/geggo/uvml <https://github.com/geggo/uvml> show that memory access to main memory limits the performance for the calculation of exp for large array sizes . This test was done quite some time ago, memory bandwidth now typically is higher, but also computational power.


> This also removes the requirement from the library to provide a strided
> api, we can copy the strided data into a contiguous buffer and pass it
> to the library without losing much performance. It may not be optimal
> (e.g. a library can fine tune the prefetching better for the case where
> the hardware is not ideal) but most likely sufficient.

Copying the data to a small enough buffer so it fits into cache might add a few cycles, this already impacts performance significantly. Curious to see how much.

Gregor

> 
> Figuring out how to best do it to get the best performance and still
> being flexible in what implementation is used is part of the challenge
> the student will face for this project.
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20150324/3e90b099/attachment.html>


More information about the NumPy-Discussion mailing list