[Numpy-discussion] Speedup by avoiding memory alloc twice in scalar array

Frédéric Bastien nouiz at nouiz.org
Wed Jul 17 12:57:16 EDT 2013


On Wed, Jul 17, 2013 at 10:39 AM, Nathaniel Smith <njs at pobox.com> wrote:

> On Tue, Jul 16, 2013 at 7:53 PM, Frédéric Bastien <nouiz at nouiz.org> wrote:
> > Hi,
> >
> >
> > On Tue, Jul 16, 2013 at 11:55 AM, Nathaniel Smith <njs at pobox.com> wrote:
> >>
> >> On Tue, Jul 16, 2013 at 2:34 PM, Arink Verma <arinkverma at gmail.com>
> wrote:
> >>>
> >>> >Each ndarray does two mallocs, for the obj and buffer. These could be
> >>> > combined into 1 - just allocate the total size and do some pointer
> >>> > >arithmetic, then set OWNDATA to false.
> >>> So, that two mallocs has been mentioned in project introduction. I got
> >>> that wrong.
> >>
> >>
> >> On further thought/reading the code, it appears to be more complicated
> >> than that, actually.
> >>
> >> It looks like (for a non-scalar array) we have 2 calls to PyMem_Malloc:
> 1
> >> for the array object itself, and one for the shapes + strides. And, one
> call
> >> to regular-old malloc: for the data buffer.
> >>
> >> (Mysteriously, shapes + strides together have 2*ndim elements, but to
> hold
> >> them we allocate a memory region sized to hold 3*ndim elements. I'm not
> sure
> >> why.)
> >>
> >> And contrary to what I said earlier, this is about as optimized as it
> can
> >> be without breaking ABI. We need at least 2 calls to
> malloc/PyMem_Malloc,
> >> because the shapes+strides may need to be resized without affecting the
> much
> >> larger data area. But it's tempting to allocate the array object and the
> >> data buffer in a single memory region, like I suggested earlier. And
> this
> >> would ALMOST work. But, it turns out there is code out there which
> assumes
> >> (whether wisely or not) that you can swap around which data buffer a
> given
> >> PyArrayObject refers to (hi Theano!). And supporting this means that
> data
> >> buffers and PyArrayObjects need to be in separate memory regions.
> >
> >
> > Are you sure that Theano "swap" the data ptr of an ndarray? When we play
> > with that, it is on a newly create ndarray. So a node in our graph, won't
> > change the input ndarray structure. It will create a new ndarray
> structure
> > with new shape/strides and pass a data ptr and we flag the new ndarray
> with
> > own_data correctly to my knowledge.
> >
> > If Theano pose a problem here, I'll suggest that I fix Theano. But
> currently
> > I don't see the problem. So if this make you change your mind about this
> > optimization, tell me. I don't want Theano to prevent optimization in
> NumPy.
>
> It's entirely possible I misunderstood, so let's see if we can work it
> out. I know that you want to assign to the ->data pointer in a
> PyArrayObject, right? That's what caused some trouble with the 1.7 API
> deprecations, which were trying to prevent direct access to this
> field? Creating a new array given a pointer to a memory region is no
> problem, and obviously will be supported regardless of any
> optimizations. But if that's all you were doing then you shouldn't
> have run into the deprecation problem. Or maybe I'm misremembering!
>

What is currently done at only 1 place is to create a new PyArrayObject
with a given ptr. So NumPy don't do the allocation. We later change that
ptr to another one.

It is the change to the ptr of the just created PyArrayObject that caused
problem with the interface deprecation. I fixed all other problem releated
to the deprecation (mostly just rename of function/macro). But I didn't
fixed this one yet. I would need to change the logic to compute the final
ptr before creating the PyArrayObject object and create it with the final
data ptr. But in call cases, NumPy didn't allocated data memory for this
object, so this case don't block your optimization.

One thing in our optimization "wish list" is to reuse allocated
PyArrayObject between Theano function call for intermediate results(so
completly under Theano control). This could be useful in particular for
reshape/transpose/subtensor. Those functions are pretty fast and from
memory, I already found the allocation time was significant. But in those
cases, it is on PyArrayObject that are views, so the metadata and the data
would be in different memory region in all cases.

The other cases of optimization "wish list"  is if  we want to reuse the
PyArrayObject when the shape isn't the good one (but the number of
dimensions is the same). If we do that for operation like addition, we will
need to use PyArray_Resize(). This will be done on PyArrayObject whose data
memory was allocated by NumPy. So if you do one memory allowcation for
metadata and data, just make sure that PyArray_Resize() will handle that
correctly.

On the usefulness of doing only 1 memory allocation, on our old gpu
ndarray, we where doing 2 alloc on the GPU, one for metadata and one for
data. I removed this, as this was a bottleneck. allocation on the CPU are
faster the on the GPU, but this is still something that is slow except if
you reuse memory. Do PyMem_Malloc, reuse previous small allocation?

For those that read up all this, the conclusion is that Theano should block
this optimization. If you optimize the allocation of new PyArrayObject,
they will be less incentive to do the "wish list" optimization.

One last thing to keep in mind is that you should keep the data segment
aligned. I would arg that alignment on the datatype size isn't enough, so I
would suggest on cache line size or something like this. But I don't have
number to base this one. This would also help in the case of resize that
change the number of dimensions.


Fred
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20130717/abaecb79/attachment.html>


More information about the NumPy-Discussion mailing list