[Numpy-discussion] understanding buffering done when broadcasting

Wed Nov 25 03:21:25 EST 2015

On Di, 2015-11-24 at 16:49 -0800, Eli Bendersky wrote:
> 
> 
> On Mon, Nov 23, 2015 at 2:09 PM, Sebastian Berg
> <sebastian at sipsolutions.net> wrote:
>         On Mo, 2015-11-23 at 13:31 -0800, Eli Bendersky wrote:
>         > Hello,
>         >
>         >
>         > I'm trying to understand the buffering done by the Numpy
>         iterator
>         > interface (the new post 1.6-one) when running ufuncs on
>         arrays that
>         > require broadcasting. Consider this simple case:
>         >
>         > In [35]: m = np.arange(16).reshape(4,4)
>         > In [37]: n = np.arange(4)
>         > In [39]: m + n
>         > Out[39]:
>         > array([[ 0,  2,  4,  6],
>         >        [ 4,  6,  8, 10],
>         >        [ 8, 10, 12, 14],
>         >        [12, 14, 16, 18]])
>         >
>         >
>         
>         
>         On first sight this seems true. However, there is one other
>         point to
>         consider here. The inner ufunc loop can only handle a single
>         stride. The
>         contiguous array `n` has to be iterated as if it had the
>         strides
>         `(0, 8)`, which is not the strides of the contiguous array `m`
>         which can
>         be "unrolled" to 1-D. Those effective strides are thus not
>         contiguous
>         for the inner ufunc loop and cannot be unrolled into a single
>         ufunc
>         call!
>         
>         The optimization (which might kick in a bit more broadly
>         maybe), is thus
>         that the number of inner loop calls is minimized, whether that
>         is worth
>         it, I am not sure, it may well be that there is some worthy
>         optimization
>         possible here.
>         Note however, that this does not occur for large inner loop
>         sizes
>         (though I think you can find some "bad" sizes):
>         
>         ```
>         In [18]: n = np.arange(40000)
>         
>         In [19]: m = np.arange(160000).reshape(4,40000)
>         
>         In [20]: o = m + n
>         Iterator: Checking casting for operand 0
>         op: dtype('int64'), iter: dtype('int64')
>         Iterator: Checking casting for operand 1
>         op: dtype('int64'), iter: dtype('int64')
>         Iterator: Checking casting for operand 2
>         op: <null>, iter: dtype('int64')
>         Iterator: Setting allocated stride 1 for iterator dimension 0
>         to 8
>         Iterator: Setting allocated stride 0 for iterator dimension 1
>         to 320000
>         Iterator: Copying inputs to buffers
>         Iterator: Expanding inner loop size from 8192 to 40000 since
>         buffering
>         wasn't needed
>         Any buffering needed: 0
>         Iterator: Finished copying inputs to buffers (buffered size is
>         40000)
>         ```
> 
> 
> The heuristic in the code says that if we can use a single stride and
> that's larger than the buffer size (which I assume is the default
> buffer size, and can change) then it's "is_onestride" and no buffering
> is done.
> 
> 
> So this led me to explore around this threshold (8192 items by default
> on my machine), and indeed we can notice funny behavior:
> 
> In [51]: %%timeit n = arange(8192); m =
> np.arange(8192*40).reshape(40,8192)
> o = m + n
>    ....: 
> 1000 loops, best of 3: 274 µs per loop
> 
> In [52]: %%timeit n = arange(8292); m =
> np.arange(8292*40).reshape(40,8292)
> o = m + n
>    ....: 
> 1000 loops, best of 3: 229 µs per loop
> 
> 
> So, given this, it's not very clear why the "optimization" kicks in.
> Buffering for small sizes seems like a mistake.
> 

I am pretty sure it is not generally a mistake. Consider the case of an
10000x3 array (note that shrinking the buffer can have great advantage
though, I am not sure if this is done usually).
If you have (10000, 3) + (3,) arrays, then the ufunc outer loop would
have 10000x overhead. Doing the buffering (which I believe has some
extra code to be faster), will lower this to a few ufunc inner loop
calls.
I have not timed it, but I would be a bit surprised if it was not faster
in this case at least. Even calling a C function (and looping) for an
inner loop of 3 elements, should be quite a bit of overhead, and my
guess is more overhead than the buffering.

- Sebastian

> 
> Eli
> 
> 
>  
>         
>         Anyway, feel free to have a look ;). The code is not the most
>         read one
>         in NumPy, and it would not surprise me a lot if you can find
>         something
>         to tweak.
>         
>         - Sebastian
>         
>         
>         >
>         > If I instrument Numpy (setting NPY_IT_DBG_TRACING and such),
>         I see
>         > that when the add() ufunc is called, 'n' is copied into a
>         temporary
>         > buffer by the iterator. The ufunc then gets the buffer as
>         its data.
>         >
>         >
>         > My question is: why is this buffering needed? It seems
>         wasteful, since
>         > no casting is required here, no special alignment problems
>         and also
>         > 'n' is contiguously laid out in memory. It seems that it
>         would be more
>         > efficient to just use 'n' in the ufunc instead of passing in
>         the
>         > buffer. What am I missing?
>         >
>         >
>         > Thanks in advance,
>         >
>         > Eli
>         >
>         > _______________________________________________
>         > NumPy-Discussion mailing list
>         > NumPy-Discussion at scipy.org
>         > https://mail.scipy.org/mailman/listinfo/numpy-discussion
>         
>         
>         _______________________________________________
>         NumPy-Discussion mailing list
>         NumPy-Discussion at scipy.org
>         https://mail.scipy.org/mailman/listinfo/numpy-discussion
>         
> 
> 
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> https://mail.scipy.org/mailman/listinfo/numpy-discussion

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: This is a digitally signed message part
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20151125/6fc04c58/attachment.sig>