<div dir="ltr">2016-10-05 8:45 GMT+02:00 srean <span dir="ltr"><<a href="mailto:srean.list@gmail.com" target="_blank">srean.list@gmail.com</a>></span>:<br><div class="gmail_extra"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div>Good discussion, but was surprised by the absence of numexpr in the discussion., given how relevant it (numexpr) is to the topic.</div><div><br></div><div>Is the goal to fold in the numexpr functionality (and beyond) into Numpy ?</div></div></blockquote><div><br></div><div>Yes, the question about merging numexpr into numpy has been something that periodically shows up in this list.  I think mostly everyone agree that it is a good idea, but things are not so easy, and so far nobody provided a good patch for this.  Also, the fact that numexpr relies on grouping an expression by using a string (e.g. (y = ne.evaluate("x**3 + tanh(x**2) + 4")) does not play well with the way in that numpy evaluates expressions, so something should be suggested to cope with this too.<br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="gmail_extra"><br><div class="gmail_quote"><div><div class="h5">On Fri, Sep 30, 2016 at 7:08 PM, Julian Taylor <span dir="ltr"><<a href="mailto:jtaylor.debian@googlemail.com" target="_blank">jtaylor.debian@googlemail.com</a><wbr>></span> wrote:<br></div></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div><div class="h5">hi,<br>

Temporary arrays generated in expressions are expensive as the imply<br>

extra memory bandwidth which is the bottleneck in most numpy operations.<br>

For example:<br>

<br>

r = a + b + c<br>

<br>

creates the b + c temporary and then adds a to it.<br>

This can be rewritten to be more efficient using inplace operations:<br>

<br>

r = b + c<br>

r += a<br>

<br>

This saves some memory bandwidth and can speedup the operation by 50%<br>

for very large arrays or even more if the inplace operation allows it to<br>

be completed completely in the cpu cache.<br>

<br>

The problem is that inplace operations are a lot less readable so they<br>

are often only used in well optimized code. But due to pythons<br>

refcounting semantics we can actually do some inplace conversions<br>

transparently.<br>

If an operand in python has a reference count of one it must be a<br>

temporary so we can use it as the destination array. CPython itself does<br>

this optimization for string concatenations.<br>

<br>

In numpy we have the issue that we can be called from the C-API directly<br>

where the reference count may be one for other reasons.<br>

To solve this we can check the backtrace until the python frame<br>

evaluation function. If there are only numpy and python functions in<br>

between that and our entry point we should be able to elide the temporary.<br>

<br>

This PR implements this:<br>

<a href="https://github.com/numpy/numpy/pull/7997" rel="noreferrer" target="_blank">https://github.com/numpy/numpy<wbr>/pull/7997</a><br>

<br>

It currently only supports Linux with glibc (which has reliable<br>

backtraces via unwinding) and maybe MacOS depending on how good their<br>

backtrace is. On windows the backtrace APIs are different and I don't<br>

know them but in theory it could also be done there.<br>

<br>

A problem is that checking the backtrace is quite expensive, so should<br>

only be enabled when the involved arrays are large enough for it to be<br>

worthwhile. In my testing this seems to be around 180-300KiB sized<br>

arrays, basically where they start spilling out of the CPU L2 cache.<br>

<br>

I made a little crappy benchmark script to test this cutoff in this branch:<br>

<a href="https://github.com/juliantaylor/numpy/tree/elide-bench" rel="noreferrer" target="_blank">https://github.com/juliantaylo<wbr>r/numpy/tree/elide-bench</a><br>

<br>

If you are interested you can run it with:<br>

python setup.py build_ext -j 4 --inplace<br>

ipython --profile=null check.ipy<br>

<br>

At the end it will plot the ratio between elided and non-elided runtime.<br>

It should get larger than one around 180KiB on most cpus.<br>

<br>

If no one points out some flaw in the approach, I'm hoping to get this<br>

into the next numpy version.<br>

<br>

cheers,<br>

Julian<br>

<br>

<br></div></div><span class="">______________________________<wbr>_________________<br>

NumPy-Discussion mailing list<br>

<a href="mailto:NumPy-Discussion@scipy.org" target="_blank">NumPy-Discussion@scipy.org</a><br>

<a href="https://mail.scipy.org/mailman/listinfo/numpy-discussion" rel="noreferrer" target="_blank">https://mail.scipy.org/mailman<wbr>/listinfo/numpy-discussion</a><br>

<br></span></blockquote></div><br></div>

<br>______________________________<wbr>_________________<br>

NumPy-Discussion mailing list<br>

<a href="mailto:NumPy-Discussion@scipy.org">NumPy-Discussion@scipy.org</a><br>

<a href="https://mail.scipy.org/mailman/listinfo/numpy-discussion" rel="noreferrer" target="_blank">https://mail.scipy.org/<wbr>mailman/listinfo/numpy-<wbr>discussion</a><br>

<br></blockquote></div><br><br clear="all"><br>-- <br><div class="gmail_signature" data-smartmail="gmail_signature">Francesc Alted</div>

</div></div>