[Numpy-discussion] Speedup by avoiding memory alloc twice in scalar array

Julian Taylor jtaylor.debian at googlemail.com
Wed Jan 8 13:13:20 EST 2014


On 18.07.2013 15:36, Nathaniel Smith wrote:
> On Wed, Jul 17, 2013 at 5:57 PM, Frédéric Bastien <nouiz at nouiz.org> wrote:
>> On Wed, Jul 17, 2013 at 10:39 AM, Nathaniel Smith <njs at pobox.com> wrote:
>>>>
>>>> On Tue, Jul 16, 2013 at 11:55 AM, Nathaniel Smith <njs at pobox.com> wrote:
>>> It's entirely possible I misunderstood, so let's see if we can work it
>>> out. I know that you want to assign to the ->data pointer in a
>>> PyArrayObject, right? That's what caused some trouble with the 1.7 API
>>> deprecations, which were trying to prevent direct access to this
>>> field? Creating a new array given a pointer to a memory region is no
>>> problem, and obviously will be supported regardless of any
>>> optimizations. But if that's all you were doing then you shouldn't
>>> have run into the deprecation problem. Or maybe I'm misremembering!
>>
>> What is currently done at only 1 place is to create a new PyArrayObject with
>> a given ptr. So NumPy don't do the allocation. We later change that ptr to
>> another one.
> 
> Hmm, OK, so that would still work. If the array has the OWNDATA flag
> set (or you otherwise know where the data came from), then swapping
> the data pointer would still work.
> 
> The change would be that in most cases when asking numpy to allocate a
> new array from scratch, the OWNDATA flag would not be set. That's
> because the OWNDATA flag really means "when this object is
> deallocated, call free(self->data)", but if we allocate the array
> struct and the data buffer together in a single memory region, then
> deallocating the object will automatically cause the data buffer to be
> deallocated as well, without the array destructor having to take any
> special effort.
> 
>> It is the change to the ptr of the just created PyArrayObject that caused
>> problem with the interface deprecation. I fixed all other problem releated
>> to the deprecation (mostly just rename of function/macro). But I didn't
>> fixed this one yet. I would need to change the logic to compute the final
>> ptr before creating the PyArrayObject object and create it with the final
>> data ptr. But in call cases, NumPy didn't allocated data memory for this
>> object, so this case don't block your optimization.
> 
> Right.
> 
>> One thing in our optimization "wish list" is to reuse allocated
>> PyArrayObject between Theano function call for intermediate results(so
>> completly under Theano control). This could be useful in particular for
>> reshape/transpose/subtensor. Those functions are pretty fast and from
>> memory, I already found the allocation time was significant. But in those
>> cases, it is on PyArrayObject that are views, so the metadata and the data
>> would be in different memory region in all cases.
>>
>> The other cases of optimization "wish list"  is if  we want to reuse the
>> PyArrayObject when the shape isn't the good one (but the number of
>> dimensions is the same). If we do that for operation like addition, we will
>> need to use PyArray_Resize(). This will be done on PyArrayObject whose data
>> memory was allocated by NumPy. So if you do one memory allowcation for
>> metadata and data, just make sure that PyArray_Resize() will handle that
>> correctly.
> 
> I'm not sure I follow the details here, but it does turn out that a
> really surprising amount of time in PyArray_NewFromDescr is spent in
> just calculating and writing out the shape and strides buffers, so for
> programs that e.g. use hundreds of small 3-element arrays to represent
> points in space, re-using even these buffers might be a big win...
> 
>> On the usefulness of doing only 1 memory allocation, on our old gpu ndarray,
>> we where doing 2 alloc on the GPU, one for metadata and one for data. I
>> removed this, as this was a bottleneck. allocation on the CPU are faster the
>> on the GPU, but this is still something that is slow except if you reuse
>> memory. Do PyMem_Malloc, reuse previous small allocation?
> 
> Yes, at least in theory PyMem_Malloc is highly-optimized for small
> buffer re-use. (For requests >256 bytes it just calls malloc().) And
> it's possible to define type-specific freelists; not sure if there's
> any value in doing that for PyArrayObjects. See Objects/obmalloc.c in
> the Python source tree.
> 
> -n

PyMem_Malloc is just a wrapper around malloc, so its only as optimized
as the c library is (glibc is not good for small allocations).
PyObject_Malloc uses a small object allocator for requests smaller 512
bytes (256 in python2).

I filed a pull request [0] replacing a few functions which I think are
safe to convert to this API. The nditer allocation which is completely
encapsulated and the construction of the scalar and array python objects
which are deleted via the tp_free slot (we really should not support
third party libraries using PyMem_Free on python objects without checks).

This already gives up to 15% improvements for scalar operations compared
to glibc 2.17 malloc.
Do I understand the discussions here right that we could replace
PyDimMem_NEW  which is used for strides in PyArray with the small object
allocation too?
It would still allow swapping the stride buffer, but every application
must then delete it with PyDimMem_FREE which should be a reasonable
requirement.

[0] https://github.com/numpy/numpy/pull/4177




More information about the NumPy-Discussion mailing list