Julian,

This is really, really cool!

I have been wanting something like this for years (over a decade? wow!), but always thought it would require hacking the interpreter to intercept operations. This is a really inspired idea, and could buy numpy a lot of performance.

I'm afraid I can't say much about the implementation details -- but great work!

-Chris




On Fri, Sep 30, 2016 at 2:50 PM, Julian Taylor <jtaylor.debian@googlemail.com> wrote:
On 30.09.2016 23:09, josef.pktd@gmail.com wrote:
> On Fri, Sep 30, 2016 at 9:38 AM, Julian Taylor
> <jtaylor.debian@googlemail.com> wrote:
>> hi,
>> Temporary arrays generated in expressions are expensive as the imply
>> extra memory bandwidth which is the bottleneck in most numpy operations.
>> For example:
>>
>> r = a + b + c
>>
>> creates the b + c temporary and then adds a to it.
>> This can be rewritten to be more efficient using inplace operations:
>>
>> r = b + c
>> r += a
>
> general question (I wouldn't understand the details even if I looked.)
>
> how is this affected by broadcasting and type promotion?
>
> Some of the main reasons that I don't like to use inplace operation in
> general is that I'm often not sure when type promotion occurs and when
> arrays expand during broadcasting.
>
> for example b + c is 1-D, a is 2-D, and r has the broadcasted shape.
> another case when I switch away from broadcasting is when b + c is int
> or bool and a is float. Thankfully, we get error messages for casting
> now.

the temporary is only avoided when the casting follows the safe rule, so
it should be the same as what you get without inplace operations. E.g.
float32-temporary + float64 will not be converted to the unsafe float32
+= float64 which a normal inplace operations would allow. But
float64-temp + float32 is transformed.

Currently the only broadcasting that will be transformed is temporary +
scalar value, otherwise it will only work on matching array sizes.
Though there is not really anything that prevents full broadcasting but
its not implemented yet in the PR.

>
>>
>> This saves some memory bandwidth and can speedup the operation by 50%
>> for very large arrays or even more if the inplace operation allows it to
>> be completed completely in the cpu cache.
>
> I didn't realize the difference can be so large. That would make
> streamlining some code worth the effort.
>
> Josef
>
>
>>
>> The problem is that inplace operations are a lot less readable so they
>> are often only used in well optimized code. But due to pythons
>> refcounting semantics we can actually do some inplace conversions
>> transparently.
>> If an operand in python has a reference count of one it must be a
>> temporary so we can use it as the destination array. CPython itself does
>> this optimization for string concatenations.
>>
>> In numpy we have the issue that we can be called from the C-API directly
>> where the reference count may be one for other reasons.
>> To solve this we can check the backtrace until the python frame
>> evaluation function. If there are only numpy and python functions in
>> between that and our entry point we should be able to elide the temporary.
>>
>> This PR implements this:
>> https://github.com/numpy/numpy/pull/7997
>>
>> It currently only supports Linux with glibc (which has reliable
>> backtraces via unwinding) and maybe MacOS depending on how good their
>> backtrace is. On windows the backtrace APIs are different and I don't
>> know them but in theory it could also be done there.
>>
>> A problem is that checking the backtrace is quite expensive, so should
>> only be enabled when the involved arrays are large enough for it to be
>> worthwhile. In my testing this seems to be around 180-300KiB sized
>> arrays, basically where they start spilling out of the CPU L2 cache.
>>
>> I made a little crappy benchmark script to test this cutoff in this branch:
>> https://github.com/juliantaylor/numpy/tree/elide-bench
>>
>> If you are interested you can run it with:
>> python setup.py build_ext -j 4 --inplace
>> ipython --profile=null check.ipy
>>
>> At the end it will plot the ratio between elided and non-elided runtime.
>> It should get larger than one around 180KiB on most cpus.
>>
>> If no one points out some flaw in the approach, I'm hoping to get this
>> into the next numpy version.
>>
>> cheers,
>> Julian
>>
>>
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion@scipy.org
>> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>



_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion




--

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker@noaa.gov