On Mon, Oct 3, 2016 at 3:16 AM, Julian Taylor <jtaylor.debian@googlemail.com> wrote:
the problem with this approach is that we don't really want numpy
hogging on to hundreds of megabytes of memory by default so it would
need to be a user option.

indeed -- but one could set an LRU cache to be very small (few items, not small memory), and then it get used within expressions, but not hold on to much outside of expressions.

However, is the allocation the only (Or even biggest) source of the performance hit?
If you generate a temporary as a result of an operation, rather than doing it in-place, that temporary needs to be allocated, but it also means that an additional array needs to be pushed through the processor -- and that can make a big performance difference too.

I"m not entirely sure how to profile this correctly, but this seems to indicate that the allocation is cheap compared to the operations (for a million--element array)

* Regular old temporary creation

In [24]: def f1(arr1, arr2):
    ...:     result = arr1 + arr2
    ...:     return result

In [26]: %timeit f1(arr1, arr2)
1000 loops, best of 3: 1.13 ms per loop

* Completely in-place, no allocation of an extra array

In [27]: def f2(arr1, arr2):
    ...:     arr1 += arr2
    ...:     return arr1

In [28]: %timeit f2(arr1, arr2)
1000 loops, best of 3: 755 µs per loop

So that's about 30% faster

* allocate a temporary that isn't used -- but should catch the creation cost

In [29]: def f3(arr1, arr2):
    ...:     result = np.empty_like(arr1)
    ...:     arr1 += arr2
    ...:     return arr1

In [30]: % timeit f3(arr1, arr2)

1000 loops, best of 3: 756 µs per loop

only a µs slower!

Profiling is hard, and I'm not good at it, but this seems to indicate that the allocation is cheap.



Christopher Barker, Ph.D.

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception