[Numpy-discussion] Speedup by avoiding memory alloc twice in scalar array

Frédéric Bastien nouiz at nouiz.org
Wed Jan 8 14:04:38 EST 2014


As told, I don't think Theano swap the stride buffer. Most of the
time, we allocated with PyArray_empty or zeros. (not sure of the
capitals). The only exception I remember have been changed in the last
release to use PyArray_NewFromDescr(). Before that, we where
allocating the PyArray with the right number of dimensions, then we
where manually filling the ptr, shapes and strides. I don't recall any
swapping of pointer for shapes and strides in Theano.

So I don't see why Theano would prevent doing just one malloc for the
struct and the shapes/strides. If it does, tell me and I'll fix
Theano:) I don't want Theano to prevent optimization in NumPy. Theano
now support completly the new NumPy C-API interface.

Nathaniel also told that resizing the PyArray could prevent that. When
Theano call PyArray_resize (not sure of the syntax), we always keep
the number of dimensions the same. But I don't know if other code do
differently. That could be a reason to keep separate alloc.

I don't know any software that manually free the strides/shapes
pointer to swap it. So I also think your suggestion to change
PyDimMem_NEW to call the small allocator is good. The new interface
prevent people from doing that anyway I think. Do we need to wait
until we completly remove the old interface for this?


On Wed, Jan 8, 2014 at 1:13 PM, Julian Taylor
<jtaylor.debian at googlemail.com> wrote:
> On 18.07.2013 15:36, Nathaniel Smith wrote:
>> On Wed, Jul 17, 2013 at 5:57 PM, Frédéric Bastien <nouiz at nouiz.org> wrote:
>>> On Wed, Jul 17, 2013 at 10:39 AM, Nathaniel Smith <njs at pobox.com> wrote:
>>>>> On Tue, Jul 16, 2013 at 11:55 AM, Nathaniel Smith <njs at pobox.com> wrote:
>>>> It's entirely possible I misunderstood, so let's see if we can work it
>>>> out. I know that you want to assign to the ->data pointer in a
>>>> PyArrayObject, right? That's what caused some trouble with the 1.7 API
>>>> deprecations, which were trying to prevent direct access to this
>>>> field? Creating a new array given a pointer to a memory region is no
>>>> problem, and obviously will be supported regardless of any
>>>> optimizations. But if that's all you were doing then you shouldn't
>>>> have run into the deprecation problem. Or maybe I'm misremembering!
>>> What is currently done at only 1 place is to create a new PyArrayObject with
>>> a given ptr. So NumPy don't do the allocation. We later change that ptr to
>>> another one.
>> Hmm, OK, so that would still work. If the array has the OWNDATA flag
>> set (or you otherwise know where the data came from), then swapping
>> the data pointer would still work.
>> The change would be that in most cases when asking numpy to allocate a
>> new array from scratch, the OWNDATA flag would not be set. That's
>> because the OWNDATA flag really means "when this object is
>> deallocated, call free(self->data)", but if we allocate the array
>> struct and the data buffer together in a single memory region, then
>> deallocating the object will automatically cause the data buffer to be
>> deallocated as well, without the array destructor having to take any
>> special effort.
>>> It is the change to the ptr of the just created PyArrayObject that caused
>>> problem with the interface deprecation. I fixed all other problem releated
>>> to the deprecation (mostly just rename of function/macro). But I didn't
>>> fixed this one yet. I would need to change the logic to compute the final
>>> ptr before creating the PyArrayObject object and create it with the final
>>> data ptr. But in call cases, NumPy didn't allocated data memory for this
>>> object, so this case don't block your optimization.
>> Right.
>>> One thing in our optimization "wish list" is to reuse allocated
>>> PyArrayObject between Theano function call for intermediate results(so
>>> completly under Theano control). This could be useful in particular for
>>> reshape/transpose/subtensor. Those functions are pretty fast and from
>>> memory, I already found the allocation time was significant. But in those
>>> cases, it is on PyArrayObject that are views, so the metadata and the data
>>> would be in different memory region in all cases.
>>> The other cases of optimization "wish list"  is if  we want to reuse the
>>> PyArrayObject when the shape isn't the good one (but the number of
>>> dimensions is the same). If we do that for operation like addition, we will
>>> need to use PyArray_Resize(). This will be done on PyArrayObject whose data
>>> memory was allocated by NumPy. So if you do one memory allowcation for
>>> metadata and data, just make sure that PyArray_Resize() will handle that
>>> correctly.
>> I'm not sure I follow the details here, but it does turn out that a
>> really surprising amount of time in PyArray_NewFromDescr is spent in
>> just calculating and writing out the shape and strides buffers, so for
>> programs that e.g. use hundreds of small 3-element arrays to represent
>> points in space, re-using even these buffers might be a big win...
>>> On the usefulness of doing only 1 memory allocation, on our old gpu ndarray,
>>> we where doing 2 alloc on the GPU, one for metadata and one for data. I
>>> removed this, as this was a bottleneck. allocation on the CPU are faster the
>>> on the GPU, but this is still something that is slow except if you reuse
>>> memory. Do PyMem_Malloc, reuse previous small allocation?
>> Yes, at least in theory PyMem_Malloc is highly-optimized for small
>> buffer re-use. (For requests >256 bytes it just calls malloc().) And
>> it's possible to define type-specific freelists; not sure if there's
>> any value in doing that for PyArrayObjects. See Objects/obmalloc.c in
>> the Python source tree.
>> -n
> PyMem_Malloc is just a wrapper around malloc, so its only as optimized
> as the c library is (glibc is not good for small allocations).
> PyObject_Malloc uses a small object allocator for requests smaller 512
> bytes (256 in python2).
> I filed a pull request [0] replacing a few functions which I think are
> safe to convert to this API. The nditer allocation which is completely
> encapsulated and the construction of the scalar and array python objects
> which are deleted via the tp_free slot (we really should not support
> third party libraries using PyMem_Free on python objects without checks).
> This already gives up to 15% improvements for scalar operations compared
> to glibc 2.17 malloc.
> Do I understand the discussions here right that we could replace
> PyDimMem_NEW  which is used for strides in PyArray with the small object
> allocation too?
> It would still allow swapping the stride buffer, but every application
> must then delete it with PyDimMem_FREE which should be a reasonable
> requirement.
> [0] https://github.com/numpy/numpy/pull/4177
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion

More information about the NumPy-Discussion mailing list