[Numpy-discussion] automatically avoiding temporary arrays

Sat Feb 25 05:17:55 EST 2017

hi,
This PR has now been merged. As it has the potential to break stuff
please test your applications, in particular ones that use cython and
C-extensions against master.

It will only do something when working on large arrays (> 256 KiB) on
platforms providing the backtrace() function. So linux and mac but
windows probably not (though windows could probably be added if someone
volunteers to write the equivalent winapi code).

If you compile from source and want to see for sure if it is doing
anything you can set the NPY_ELIDE_DEBUG define in
./numpy/core/src/multiarray/temp_elide.c to 1 or 2 and it will print out
when it is eliding something.

cheers,
Julian

On 30.09.2016 15:38, Julian Taylor wrote:
> hi,
> Temporary arrays generated in expressions are expensive as the imply
> extra memory bandwidth which is the bottleneck in most numpy operations.
> For example:
> 
> r = a + b + c
> 
> creates the b + c temporary and then adds a to it.
> This can be rewritten to be more efficient using inplace operations:
> 
> r = b + c
> r += a
> 
> This saves some memory bandwidth and can speedup the operation by 50%
> for very large arrays or even more if the inplace operation allows it to
> be completed completely in the cpu cache.
> 
> The problem is that inplace operations are a lot less readable so they
> are often only used in well optimized code. But due to pythons
> refcounting semantics we can actually do some inplace conversions
> transparently.
> If an operand in python has a reference count of one it must be a
> temporary so we can use it as the destination array. CPython itself does
> this optimization for string concatenations.
> 
> In numpy we have the issue that we can be called from the C-API directly
> where the reference count may be one for other reasons.
> To solve this we can check the backtrace until the python frame
> evaluation function. If there are only numpy and python functions in
> between that and our entry point we should be able to elide the temporary.
> 
> This PR implements this:
> https://github.com/numpy/numpy/pull/7997
> 
> It currently only supports Linux with glibc (which has reliable
> backtraces via unwinding) and maybe MacOS depending on how good their
> backtrace is. On windows the backtrace APIs are different and I don't
> know them but in theory it could also be done there.
> 
> A problem is that checking the backtrace is quite expensive, so should
> only be enabled when the involved arrays are large enough for it to be
> worthwhile. In my testing this seems to be around 180-300KiB sized
> arrays, basically where they start spilling out of the CPU L2 cache.
> 
> I made a little crappy benchmark script to test this cutoff in this branch:
> https://github.com/juliantaylor/numpy/tree/elide-bench
> 
> If you are interested you can run it with:
> python setup.py build_ext -j 4 --inplace
> ipython --profile=null check.ipy
> 
> At the end it will plot the ratio between elided and non-elided runtime.
> It should get larger than one around 180KiB on most cpus.
> 
> If no one points out some flaw in the approach, I'm hoping to get this
> into the next numpy version.
> 
> cheers,
> Julian
>