# [Numpy-discussion] understanding buffering done when broadcasting

Sebastian Berg sebastian at sipsolutions.net
Mon Nov 23 17:09:53 EST 2015

```On Mo, 2015-11-23 at 13:31 -0800, Eli Bendersky wrote:
> Hello,
>
>
> I'm trying to understand the buffering done by the Numpy iterator
> interface (the new post 1.6-one) when running ufuncs on arrays that
> require broadcasting. Consider this simple case:
>
> In [35]: m = np.arange(16).reshape(4,4)
> In [37]: n = np.arange(4)
> In [39]: m + n
> Out[39]:
> array([[ 0,  2,  4,  6],
>        [ 4,  6,  8, 10],
>        [ 8, 10, 12, 14],
>        [12, 14, 16, 18]])
>
>

On first sight this seems true. However, there is one other point to
consider here. The inner ufunc loop can only handle a single stride. The
contiguous array `n` has to be iterated as if it had the strides
`(0, 8)`, which is not the strides of the contiguous array `m` which can
be "unrolled" to 1-D. Those effective strides are thus not contiguous
for the inner ufunc loop and cannot be unrolled into a single ufunc
call!

The optimization (which might kick in a bit more broadly maybe), is thus
that the number of inner loop calls is minimized, whether that is worth
it, I am not sure, it may well be that there is some worthy optimization
possible here.
Note however, that this does not occur for large inner loop sizes
(though I think you can find some "bad" sizes):

```
In [18]: n = np.arange(40000)

In [19]: m = np.arange(160000).reshape(4,40000)

In [20]: o = m + n
Iterator: Checking casting for operand 0
op: dtype('int64'), iter: dtype('int64')
Iterator: Checking casting for operand 1
op: dtype('int64'), iter: dtype('int64')
Iterator: Checking casting for operand 2
op: <null>, iter: dtype('int64')
Iterator: Setting allocated stride 1 for iterator dimension 0 to 8
Iterator: Setting allocated stride 0 for iterator dimension 1 to 320000
Iterator: Copying inputs to buffers
Iterator: Expanding inner loop size from 8192 to 40000 since buffering
wasn't needed
Any buffering needed: 0
Iterator: Finished copying inputs to buffers (buffered size is 40000)
```

Anyway, feel free to have a look ;). The code is not the most read one
in NumPy, and it would not surprise me a lot if you can find something
to tweak.

- Sebastian

>
> If I instrument Numpy (setting NPY_IT_DBG_TRACING and such), I see
> that when the add() ufunc is called, 'n' is copied into a temporary
> buffer by the iterator. The ufunc then gets the buffer as its data.
>
>
> My question is: why is this buffering needed? It seems wasteful, since
> no casting is required here, no special alignment problems and also
> 'n' is contiguously laid out in memory. It seems that it would be more
> efficient to just use 'n' in the ufunc instead of passing in the
> buffer. What am I missing?
>
>
>
> Eli
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> https://mail.scipy.org/mailman/listinfo/numpy-discussion

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: This is a digitally signed message part
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20151123/cd3a8b66/attachment.sig>
```