understanding buffering done when broadcasting

Hello, I'm trying to understand the buffering done by the Numpy iterator interface (the new post 1.6-one) when running ufuncs on arrays that require broadcasting. Consider this simple case: In [35]: m = np.arange(16).reshape(4,4) In [37]: n = np.arange(4) In [39]: m + n Out[39]: array([[ 0, 2, 4, 6], [ 4, 6, 8, 10], [ 8, 10, 12, 14], [12, 14, 16, 18]]) If I instrument Numpy (setting NPY_IT_DBG_TRACING and such), I see that when the add() ufunc is called, 'n' is copied into a temporary buffer by the iterator. The ufunc then gets the buffer as its data. My question is: why is this buffering needed? It seems wasteful, since no casting is required here, no special alignment problems and also 'n' is contiguously laid out in memory. It seems that it would be more efficient to just use 'n' in the ufunc instead of passing in the buffer. What am I missing? Thanks in advance, Eli

On Mo, 2015-11-23 at 13:31 -0800, Eli Bendersky wrote:
Hello,
I'm trying to understand the buffering done by the Numpy iterator interface (the new post 1.6-one) when running ufuncs on arrays that require broadcasting. Consider this simple case:
In [35]: m = np.arange(16).reshape(4,4) In [37]: n = np.arange(4) In [39]: m + n Out[39]: array([[ 0, 2, 4, 6], [ 4, 6, 8, 10], [ 8, 10, 12, 14], [12, 14, 16, 18]])
On first sight this seems true. However, there is one other point to consider here. The inner ufunc loop can only handle a single stride. The contiguous array `n` has to be iterated as if it had the strides `(0, 8)`, which is not the strides of the contiguous array `m` which can be "unrolled" to 1-D. Those effective strides are thus not contiguous for the inner ufunc loop and cannot be unrolled into a single ufunc call! The optimization (which might kick in a bit more broadly maybe), is thus that the number of inner loop calls is minimized, whether that is worth it, I am not sure, it may well be that there is some worthy optimization possible here. Note however, that this does not occur for large inner loop sizes (though I think you can find some "bad" sizes): ``` In [18]: n = np.arange(40000) In [19]: m = np.arange(160000).reshape(4,40000) In [20]: o = m + n Iterator: Checking casting for operand 0 op: dtype('int64'), iter: dtype('int64') Iterator: Checking casting for operand 1 op: dtype('int64'), iter: dtype('int64') Iterator: Checking casting for operand 2 op: <null>, iter: dtype('int64') Iterator: Setting allocated stride 1 for iterator dimension 0 to 8 Iterator: Setting allocated stride 0 for iterator dimension 1 to 320000 Iterator: Copying inputs to buffers Iterator: Expanding inner loop size from 8192 to 40000 since buffering wasn't needed Any buffering needed: 0 Iterator: Finished copying inputs to buffers (buffered size is 40000) ``` Anyway, feel free to have a look ;). The code is not the most read one in NumPy, and it would not surprise me a lot if you can find something to tweak. - Sebastian
If I instrument Numpy (setting NPY_IT_DBG_TRACING and such), I see that when the add() ufunc is called, 'n' is copied into a temporary buffer by the iterator. The ufunc then gets the buffer as its data.
My question is: why is this buffering needed? It seems wasteful, since no casting is required here, no special alignment problems and also 'n' is contiguously laid out in memory. It seems that it would be more efficient to just use 'n' in the ufunc instead of passing in the buffer. What am I missing?
Thanks in advance,
Eli
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion

On Mon, Nov 23, 2015 at 2:09 PM, Sebastian Berg <sebastian@sipsolutions.net> wrote:
On Mo, 2015-11-23 at 13:31 -0800, Eli Bendersky wrote:
Hello,
I'm trying to understand the buffering done by the Numpy iterator interface (the new post 1.6-one) when running ufuncs on arrays that require broadcasting. Consider this simple case:
In [35]: m = np.arange(16).reshape(4,4) In [37]: n = np.arange(4) In [39]: m + n Out[39]: array([[ 0, 2, 4, 6], [ 4, 6, 8, 10], [ 8, 10, 12, 14], [12, 14, 16, 18]])
On first sight this seems true. However, there is one other point to consider here. The inner ufunc loop can only handle a single stride. The contiguous array `n` has to be iterated as if it had the strides `(0, 8)`, which is not the strides of the contiguous array `m` which can be "unrolled" to 1-D. Those effective strides are thus not contiguous for the inner ufunc loop and cannot be unrolled into a single ufunc call!
The optimization (which might kick in a bit more broadly maybe), is thus that the number of inner loop calls is minimized, whether that is worth it, I am not sure, it may well be that there is some worthy optimization possible here. Note however, that this does not occur for large inner loop sizes (though I think you can find some "bad" sizes):
``` In [18]: n = np.arange(40000)
In [19]: m = np.arange(160000).reshape(4,40000)
In [20]: o = m + n Iterator: Checking casting for operand 0 op: dtype('int64'), iter: dtype('int64') Iterator: Checking casting for operand 1 op: dtype('int64'), iter: dtype('int64') Iterator: Checking casting for operand 2 op: <null>, iter: dtype('int64') Iterator: Setting allocated stride 1 for iterator dimension 0 to 8 Iterator: Setting allocated stride 0 for iterator dimension 1 to 320000 Iterator: Copying inputs to buffers Iterator: Expanding inner loop size from 8192 to 40000 since buffering wasn't needed Any buffering needed: 0 Iterator: Finished copying inputs to buffers (buffered size is 40000) ```
The heuristic in the code says that if we can use a single stride and that's larger than the buffer size (which I assume is the default buffer size, and can change) then it's "is_onestride" and no buffering is done. So this led me to explore around this threshold (8192 items by default on my machine), and indeed we can notice funny behavior: In [51]: %%timeit n = arange(8192); m = np.arange(8192*40).reshape(40,8192) o = m + n ....: 1000 loops, best of 3: 274 µs per loop In [52]: %%timeit n = arange(8292); m = np.arange(8292*40).reshape(40,8292) o = m + n ....: 1000 loops, best of 3: 229 µs per loop So, given this, it's not very clear why the "optimization" kicks in. Buffering for small sizes seems like a mistake. Eli
Anyway, feel free to have a look ;). The code is not the most read one in NumPy, and it would not surprise me a lot if you can find something to tweak.
- Sebastian
If I instrument Numpy (setting NPY_IT_DBG_TRACING and such), I see that when the add() ufunc is called, 'n' is copied into a temporary buffer by the iterator. The ufunc then gets the buffer as its data.
My question is: why is this buffering needed? It seems wasteful, since no casting is required here, no special alignment problems and also 'n' is contiguously laid out in memory. It seems that it would be more efficient to just use 'n' in the ufunc instead of passing in the buffer. What am I missing?
Thanks in advance,
Eli
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion

On Di, 2015-11-24 at 16:49 -0800, Eli Bendersky wrote:
On Mon, Nov 23, 2015 at 2:09 PM, Sebastian Berg <sebastian@sipsolutions.net> wrote: On Mo, 2015-11-23 at 13:31 -0800, Eli Bendersky wrote: > Hello, > > > I'm trying to understand the buffering done by the Numpy iterator > interface (the new post 1.6-one) when running ufuncs on arrays that > require broadcasting. Consider this simple case: > > In [35]: m = np.arange(16).reshape(4,4) > In [37]: n = np.arange(4) > In [39]: m + n > Out[39]: > array([[ 0, 2, 4, 6], > [ 4, 6, 8, 10], > [ 8, 10, 12, 14], > [12, 14, 16, 18]]) > >
On first sight this seems true. However, there is one other point to consider here. The inner ufunc loop can only handle a single stride. The contiguous array `n` has to be iterated as if it had the strides `(0, 8)`, which is not the strides of the contiguous array `m` which can be "unrolled" to 1-D. Those effective strides are thus not contiguous for the inner ufunc loop and cannot be unrolled into a single ufunc call!
The optimization (which might kick in a bit more broadly maybe), is thus that the number of inner loop calls is minimized, whether that is worth it, I am not sure, it may well be that there is some worthy optimization possible here. Note however, that this does not occur for large inner loop sizes (though I think you can find some "bad" sizes):
``` In [18]: n = np.arange(40000)
In [19]: m = np.arange(160000).reshape(4,40000)
In [20]: o = m + n Iterator: Checking casting for operand 0 op: dtype('int64'), iter: dtype('int64') Iterator: Checking casting for operand 1 op: dtype('int64'), iter: dtype('int64') Iterator: Checking casting for operand 2 op: <null>, iter: dtype('int64') Iterator: Setting allocated stride 1 for iterator dimension 0 to 8 Iterator: Setting allocated stride 0 for iterator dimension 1 to 320000 Iterator: Copying inputs to buffers Iterator: Expanding inner loop size from 8192 to 40000 since buffering wasn't needed Any buffering needed: 0 Iterator: Finished copying inputs to buffers (buffered size is 40000) ```
The heuristic in the code says that if we can use a single stride and that's larger than the buffer size (which I assume is the default buffer size, and can change) then it's "is_onestride" and no buffering is done.
So this led me to explore around this threshold (8192 items by default on my machine), and indeed we can notice funny behavior:
In [51]: %%timeit n = arange(8192); m = np.arange(8192*40).reshape(40,8192) o = m + n ....: 1000 loops, best of 3: 274 µs per loop
In [52]: %%timeit n = arange(8292); m = np.arange(8292*40).reshape(40,8292) o = m + n ....: 1000 loops, best of 3: 229 µs per loop
So, given this, it's not very clear why the "optimization" kicks in. Buffering for small sizes seems like a mistake.
I am pretty sure it is not generally a mistake. Consider the case of an 10000x3 array (note that shrinking the buffer can have great advantage though, I am not sure if this is done usually). If you have (10000, 3) + (3,) arrays, then the ufunc outer loop would have 10000x overhead. Doing the buffering (which I believe has some extra code to be faster), will lower this to a few ufunc inner loop calls. I have not timed it, but I would be a bit surprised if it was not faster in this case at least. Even calling a C function (and looping) for an inner loop of 3 elements, should be quite a bit of overhead, and my guess is more overhead than the buffering. - Sebastian
Eli
Anyway, feel free to have a look ;). The code is not the most read one in NumPy, and it would not surprise me a lot if you can find something to tweak.
- Sebastian
> > If I instrument Numpy (setting NPY_IT_DBG_TRACING and such), I see > that when the add() ufunc is called, 'n' is copied into a temporary > buffer by the iterator. The ufunc then gets the buffer as its data. > > > My question is: why is this buffering needed? It seems wasteful, since > no casting is required here, no special alignment problems and also > 'n' is contiguously laid out in memory. It seems that it would be more > efficient to just use 'n' in the ufunc instead of passing in the > buffer. What am I missing? > > > Thanks in advance, > > Eli > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > https://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion

On Wed, Nov 25, 2015 at 12:21 AM, Sebastian Berg <sebastian@sipsolutions.net
wrote:
On Di, 2015-11-24 at 16:49 -0800, Eli Bendersky wrote:
On Mon, Nov 23, 2015 at 2:09 PM, Sebastian Berg <sebastian@sipsolutions.net> wrote: On Mo, 2015-11-23 at 13:31 -0800, Eli Bendersky wrote: > Hello, > > > I'm trying to understand the buffering done by the Numpy iterator > interface (the new post 1.6-one) when running ufuncs on arrays that > require broadcasting. Consider this simple case: > > In [35]: m = np.arange(16).reshape(4,4) > In [37]: n = np.arange(4) > In [39]: m + n > Out[39]: > array([[ 0, 2, 4, 6], > [ 4, 6, 8, 10], > [ 8, 10, 12, 14], > [12, 14, 16, 18]]) > >
On first sight this seems true. However, there is one other point to consider here. The inner ufunc loop can only handle a single stride. The contiguous array `n` has to be iterated as if it had the strides `(0, 8)`, which is not the strides of the contiguous array `m` which can be "unrolled" to 1-D. Those effective strides are thus not contiguous for the inner ufunc loop and cannot be unrolled into a single ufunc call!
The optimization (which might kick in a bit more broadly maybe), is thus that the number of inner loop calls is minimized, whether that is worth it, I am not sure, it may well be that there is some worthy optimization possible here. Note however, that this does not occur for large inner loop sizes (though I think you can find some "bad" sizes):
``` In [18]: n = np.arange(40000)
In [19]: m = np.arange(160000).reshape(4,40000)
In [20]: o = m + n Iterator: Checking casting for operand 0 op: dtype('int64'), iter: dtype('int64') Iterator: Checking casting for operand 1 op: dtype('int64'), iter: dtype('int64') Iterator: Checking casting for operand 2 op: <null>, iter: dtype('int64') Iterator: Setting allocated stride 1 for iterator dimension 0 to 8 Iterator: Setting allocated stride 0 for iterator dimension 1 to 320000 Iterator: Copying inputs to buffers Iterator: Expanding inner loop size from 8192 to 40000 since buffering wasn't needed Any buffering needed: 0 Iterator: Finished copying inputs to buffers (buffered size is 40000) ```
The heuristic in the code says that if we can use a single stride and that's larger than the buffer size (which I assume is the default buffer size, and can change) then it's "is_onestride" and no buffering is done.
So this led me to explore around this threshold (8192 items by default on my machine), and indeed we can notice funny behavior:
In [51]: %%timeit n = arange(8192); m = np.arange(8192*40).reshape(40,8192) o = m + n ....: 1000 loops, best of 3: 274 µs per loop
In [52]: %%timeit n = arange(8292); m = np.arange(8292*40).reshape(40,8292) o = m + n ....: 1000 loops, best of 3: 229 µs per loop
So, given this, it's not very clear why the "optimization" kicks in. Buffering for small sizes seems like a mistake.
I am pretty sure it is not generally a mistake. Consider the case of an 10000x3 array (note that shrinking the buffer can have great advantage though, I am not sure if this is done usually). If you have (10000, 3) + (3,) arrays, then the ufunc outer loop would have 10000x overhead. Doing the buffering (which I believe has some extra code to be faster), will lower this to a few ufunc inner loop calls. I have not timed it, but I would be a bit surprised if it was not faster in this case at least. Even calling a C function (and looping) for an inner loop of 3 elements, should be quite a bit of overhead, and my guess is more overhead than the buffering.
Yes, that's a good point for arrays shaped like this. I guess all this leaves us with is a realization that the heuristic *could* be tuned somewhat for arrays where the inner dimension is large - as in the case I demonstrated above it's nonsensical to have a computation be 20% faster when the array size increases over an arbitrary threshold. I'll see if I can find some time to dig more into this and figure out where the knobs to tweak the heuristic are. Thanks for the enlightening discussion, Sebastian Eli
- Sebastian
Eli
Anyway, feel free to have a look ;). The code is not the most read one in NumPy, and it would not surprise me a lot if you can find something to tweak.
- Sebastian
> > If I instrument Numpy (setting NPY_IT_DBG_TRACING and such), I see > that when the add() ufunc is called, 'n' is copied into a temporary > buffer by the iterator. The ufunc then gets the buffer as its data. > > > My question is: why is this buffering needed? It seems wasteful, since > no casting is required here, no special alignment problems and also > 'n' is contiguously laid out in memory. It seems that it would be more > efficient to just use 'n' in the ufunc instead of passing in the > buffer. What am I missing? > > > Thanks in advance, > > Eli > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > https://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion
participants (2)
-
Eli Bendersky
-
Sebastian Berg