[Numpy-discussion] automatically avoiding temporary arrays
faltet at gmail.com
Wed Oct 5 05:46:21 EDT 2016
2016-10-05 8:45 GMT+02:00 srean <srean.list at gmail.com>:
> Good discussion, but was surprised by the absence of numexpr in the
> discussion., given how relevant it (numexpr) is to the topic.
> Is the goal to fold in the numexpr functionality (and beyond) into Numpy ?
Yes, the question about merging numexpr into numpy has been something that
periodically shows up in this list. I think mostly everyone agree that it
is a good idea, but things are not so easy, and so far nobody provided a
good patch for this. Also, the fact that numexpr relies on grouping an
expression by using a string (e.g. (y = ne.evaluate("x**3 + tanh(x**2) +
4")) does not play well with the way in that numpy evaluates expressions,
so something should be suggested to cope with this too.
> On Fri, Sep 30, 2016 at 7:08 PM, Julian Taylor <
> jtaylor.debian at googlemail.com> wrote:
>> Temporary arrays generated in expressions are expensive as the imply
>> extra memory bandwidth which is the bottleneck in most numpy operations.
>> For example:
>> r = a + b + c
>> creates the b + c temporary and then adds a to it.
>> This can be rewritten to be more efficient using inplace operations:
>> r = b + c
>> r += a
>> This saves some memory bandwidth and can speedup the operation by 50%
>> for very large arrays or even more if the inplace operation allows it to
>> be completed completely in the cpu cache.
>> The problem is that inplace operations are a lot less readable so they
>> are often only used in well optimized code. But due to pythons
>> refcounting semantics we can actually do some inplace conversions
>> If an operand in python has a reference count of one it must be a
>> temporary so we can use it as the destination array. CPython itself does
>> this optimization for string concatenations.
>> In numpy we have the issue that we can be called from the C-API directly
>> where the reference count may be one for other reasons.
>> To solve this we can check the backtrace until the python frame
>> evaluation function. If there are only numpy and python functions in
>> between that and our entry point we should be able to elide the temporary.
>> This PR implements this:
>> It currently only supports Linux with glibc (which has reliable
>> backtraces via unwinding) and maybe MacOS depending on how good their
>> backtrace is. On windows the backtrace APIs are different and I don't
>> know them but in theory it could also be done there.
>> A problem is that checking the backtrace is quite expensive, so should
>> only be enabled when the involved arrays are large enough for it to be
>> worthwhile. In my testing this seems to be around 180-300KiB sized
>> arrays, basically where they start spilling out of the CPU L2 cache.
>> I made a little crappy benchmark script to test this cutoff in this
>> If you are interested you can run it with:
>> python setup.py build_ext -j 4 --inplace
>> ipython --profile=null check.ipy
>> At the end it will plot the ratio between elided and non-elided runtime.
>> It should get larger than one around 180KiB on most cpus.
>> If no one points out some flaw in the approach, I'm hoping to get this
>> into the next numpy version.
>> NumPy-Discussion mailing list
>> NumPy-Discussion at scipy.org
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the NumPy-Discussion