[Numpy-discussion] understanding buffering done when broadcasting

Eli Bendersky eliben at gmail.com
Fri Nov 27 11:13:52 EST 2015


On Wed, Nov 25, 2015 at 12:21 AM, Sebastian Berg <sebastian at sipsolutions.net
> wrote:

> On Di, 2015-11-24 at 16:49 -0800, Eli Bendersky wrote:
> >
> >
> > On Mon, Nov 23, 2015 at 2:09 PM, Sebastian Berg
> > <sebastian at sipsolutions.net> wrote:
> >         On Mo, 2015-11-23 at 13:31 -0800, Eli Bendersky wrote:
> >         > Hello,
> >         >
> >         >
> >         > I'm trying to understand the buffering done by the Numpy
> >         iterator
> >         > interface (the new post 1.6-one) when running ufuncs on
> >         arrays that
> >         > require broadcasting. Consider this simple case:
> >         >
> >         > In [35]: m = np.arange(16).reshape(4,4)
> >         > In [37]: n = np.arange(4)
> >         > In [39]: m + n
> >         > Out[39]:
> >         > array([[ 0,  2,  4,  6],
> >         >        [ 4,  6,  8, 10],
> >         >        [ 8, 10, 12, 14],
> >         >        [12, 14, 16, 18]])
> >         >
> >         >
> >
> >
> >         On first sight this seems true. However, there is one other
> >         point to
> >         consider here. The inner ufunc loop can only handle a single
> >         stride. The
> >         contiguous array `n` has to be iterated as if it had the
> >         strides
> >         `(0, 8)`, which is not the strides of the contiguous array `m`
> >         which can
> >         be "unrolled" to 1-D. Those effective strides are thus not
> >         contiguous
> >         for the inner ufunc loop and cannot be unrolled into a single
> >         ufunc
> >         call!
> >
> >         The optimization (which might kick in a bit more broadly
> >         maybe), is thus
> >         that the number of inner loop calls is minimized, whether that
> >         is worth
> >         it, I am not sure, it may well be that there is some worthy
> >         optimization
> >         possible here.
> >         Note however, that this does not occur for large inner loop
> >         sizes
> >         (though I think you can find some "bad" sizes):
> >
> >         ```
> >         In [18]: n = np.arange(40000)
> >
> >         In [19]: m = np.arange(160000).reshape(4,40000)
> >
> >         In [20]: o = m + n
> >         Iterator: Checking casting for operand 0
> >         op: dtype('int64'), iter: dtype('int64')
> >         Iterator: Checking casting for operand 1
> >         op: dtype('int64'), iter: dtype('int64')
> >         Iterator: Checking casting for operand 2
> >         op: <null>, iter: dtype('int64')
> >         Iterator: Setting allocated stride 1 for iterator dimension 0
> >         to 8
> >         Iterator: Setting allocated stride 0 for iterator dimension 1
> >         to 320000
> >         Iterator: Copying inputs to buffers
> >         Iterator: Expanding inner loop size from 8192 to 40000 since
> >         buffering
> >         wasn't needed
> >         Any buffering needed: 0
> >         Iterator: Finished copying inputs to buffers (buffered size is
> >         40000)
> >         ```
> >
> >
> > The heuristic in the code says that if we can use a single stride and
> > that's larger than the buffer size (which I assume is the default
> > buffer size, and can change) then it's "is_onestride" and no buffering
> > is done.
> >
> >
> > So this led me to explore around this threshold (8192 items by default
> > on my machine), and indeed we can notice funny behavior:
> >
> > In [51]: %%timeit n = arange(8192); m =
> > np.arange(8192*40).reshape(40,8192)
> > o = m + n
> >    ....:
> > 1000 loops, best of 3: 274 µs per loop
> >
> > In [52]: %%timeit n = arange(8292); m =
> > np.arange(8292*40).reshape(40,8292)
> > o = m + n
> >    ....:
> > 1000 loops, best of 3: 229 µs per loop
> >
> >
> > So, given this, it's not very clear why the "optimization" kicks in.
> > Buffering for small sizes seems like a mistake.
> >
>
> I am pretty sure it is not generally a mistake. Consider the case of an
> 10000x3 array (note that shrinking the buffer can have great advantage
> though, I am not sure if this is done usually).
> If you have (10000, 3) + (3,) arrays, then the ufunc outer loop would
> have 10000x overhead. Doing the buffering (which I believe has some
> extra code to be faster), will lower this to a few ufunc inner loop
> calls.
> I have not timed it, but I would be a bit surprised if it was not faster
> in this case at least. Even calling a C function (and looping) for an
> inner loop of 3 elements, should be quite a bit of overhead, and my
> guess is more overhead than the buffering.
>

Yes, that's a good point for arrays shaped like this. I guess all this
leaves us with is a realization that the heuristic *could* be tuned
somewhat for arrays where the inner dimension is large - as in the case I
demonstrated above it's nonsensical to have a computation be 20% faster
when the array size increases over an arbitrary threshold.

I'll see if I can find some time to dig more into this and figure out where
the knobs to tweak the heuristic are.

Thanks for the enlightening discussion, Sebastian

Eli










> - Sebastian
>
>
> >
> > Eli
> >
> >
> >
> >
> >         Anyway, feel free to have a look ;). The code is not the most
> >         read one
> >         in NumPy, and it would not surprise me a lot if you can find
> >         something
> >         to tweak.
> >
> >         - Sebastian
> >
> >
> >         >
> >         > If I instrument Numpy (setting NPY_IT_DBG_TRACING and such),
> >         I see
> >         > that when the add() ufunc is called, 'n' is copied into a
> >         temporary
> >         > buffer by the iterator. The ufunc then gets the buffer as
> >         its data.
> >         >
> >         >
> >         > My question is: why is this buffering needed? It seems
> >         wasteful, since
> >         > no casting is required here, no special alignment problems
> >         and also
> >         > 'n' is contiguously laid out in memory. It seems that it
> >         would be more
> >         > efficient to just use 'n' in the ufunc instead of passing in
> >         the
> >         > buffer. What am I missing?
> >         >
> >         >
> >         > Thanks in advance,
> >         >
> >         > Eli
> >         >
> >         > _______________________________________________
> >         > NumPy-Discussion mailing list
> >         > NumPy-Discussion at scipy.org
> >         > https://mail.scipy.org/mailman/listinfo/numpy-discussion
> >
> >
> >         _______________________________________________
> >         NumPy-Discussion mailing list
> >         NumPy-Discussion at scipy.org
> >         https://mail.scipy.org/mailman/listinfo/numpy-discussion
> >
> >
> >
> > _______________________________________________
> > NumPy-Discussion mailing list
> > NumPy-Discussion at scipy.org
> > https://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20151127/b6e15ba2/attachment.html>


More information about the NumPy-Discussion mailing list