2016-10-05 8:45 GMT+02:00 srean :
Good discussion, but was surprised by the absence of numexpr in the discussion., given how relevant it (numexpr) is to the topic.

Is the goal to fold in the numexpr functionality (and beyond) into Numpy ?

Yes, the question about merging numexpr into numpy has been something that periodically shows up in this list.  I think mostly everyone agree that it is a good idea, but things are not so easy, and so far nobody provided a good patch for this.  Also, the fact that numexpr relies on grouping an expression by using a string (e.g. (y = ne.evaluate("x**3 + tanh(x**2) + 4")) does not play well with the way in that numpy evaluates expressions, so something should be suggested to cope with this too.

On Fri, Sep 30, 2016 at 7:08 PM, Julian Taylor wrote:
hi,
Temporary arrays generated in expressions are expensive as the imply
extra memory bandwidth which is the bottleneck in most numpy operations.
For example:

r = a + b + c

creates the b + c temporary and then adds a to it.
This can be rewritten to be more efficient using inplace operations:

r = b + c
r += a

This saves some memory bandwidth and can speedup the operation by 50%
for very large arrays or even more if the inplace operation allows it to
be completed completely in the cpu cache.

The problem is that inplace operations are a lot less readable so they
are often only used in well optimized code. But due to pythons
refcounting semantics we can actually do some inplace conversions
transparently.
If an operand in python has a reference count of one it must be a
temporary so we can use it as the destination array. CPython itself does
this optimization for string concatenations.

In numpy we have the issue that we can be called from the C-API directly
where the reference count may be one for other reasons.
To solve this we can check the backtrace until the python frame
evaluation function. If there are only numpy and python functions in
between that and our entry point we should be able to elide the temporary.

This PR implements this:
https://github.com/numpy/numpy/pull/7997

It currently only supports Linux with glibc (which has reliable
backtraces via unwinding) and maybe MacOS depending on how good their
backtrace is. On windows the backtrace APIs are different and I don't
know them but in theory it could also be done there.

A problem is that checking the backtrace is quite expensive, so should
only be enabled when the involved arrays are large enough for it to be
worthwhile. In my testing this seems to be around 180-300KiB sized
arrays, basically where they start spilling out of the CPU L2 cache.

I made a little crappy benchmark script to test this cutoff in this branch:
https://github.com/juliantaylor/numpy/tree/elide-bench

If you are interested you can run it with:
python setup.py build_ext -j 4 --inplace
ipython --profile=null check.ipy

At the end it will plot the ratio between elided and non-elided runtime.
It should get larger than one around 180KiB on most cpus.

If no one points out some flaw in the approach, I'm hoping to get this
into the next numpy version.

cheers,
Julian

_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion

_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion

--
Francesc Alted