Re: [Numpy-discussion] Speedup by avoiding memory alloc twice in scalar array

8 Jan 2014

      Hi,

As told, I don't think Theano swap the stride buffer. Most of the
time, we allocated with PyArray_empty or zeros. (not sure of the
capitals). The only exception I remember have been changed in the last
release to use PyArray_NewFromDescr(). Before that, we where
allocating the PyArray with the right number of dimensions, then we
where manually filling the ptr, shapes and strides. I don't recall any
swapping of pointer for shapes and strides in Theano.

So I don't see why Theano would prevent doing just one malloc for the
struct and the shapes/strides. If it does, tell me and I'll fix
Theano:) I don't want Theano to prevent optimization in NumPy. Theano
now support completly the new NumPy C-API interface.

Nathaniel also told that resizing the PyArray could prevent that. When
Theano call PyArray_resize (not sure of the syntax), we always keep
the number of dimensions the same. But I don't know if other code do
differently. That could be a reason to keep separate alloc.

I don't know any software that manually free the strides/shapes
pointer to swap it. So I also think your suggestion to change
PyDimMem_NEW to call the small allocator is good. The new interface
prevent people from doing that anyway I think. Do we need to wait
until we completly remove the old interface for this?

Fred

On Wed, Jan 8, 2014 at 1:13 PM, Julian Taylor
<jtaylor.debian@googlemail.com> wrote:
...
On 18.07.2013 15:36, Nathaniel Smith wrote:
...
On Wed, Jul 17, 2013 at 5:57 PM, Frédéric Bastien <nouiz@nouiz.org> wrote:
...
On Wed, Jul 17, 2013 at 10:39 AM, Nathaniel Smith <njs@pobox.com> wrote:
...
...
On Tue, Jul 16, 2013 at 11:55 AM, Nathaniel Smith <njs@pobox.com> wrote:
It's entirely possible I misunderstood, so let's see if we can work it
out. I know that you want to assign to the ->data pointer in a
PyArrayObject, right? That's what caused some trouble with the 1.7 API
deprecations, which were trying to prevent direct access to this
field? Creating a new array given a pointer to a memory region is no
problem, and obviously will be supported regardless of any
optimizations. But if that's all you were doing then you shouldn't
have run into the deprecation problem. Or maybe I'm misremembering!
What is currently done at only 1 place is to create a new PyArrayObject with
a given ptr. So NumPy don't do the allocation. We later change that ptr to
another one.
Hmm, OK, so that would still work. If the array has the OWNDATA flag
set (or you otherwise know where the data came from), then swapping
the data pointer would still work.
The change would be that in most cases when asking numpy to allocate a
new array from scratch, the OWNDATA flag would not be set. That's
because the OWNDATA flag really means "when this object is
deallocated, call free(self->data)", but if we allocate the array
struct and the data buffer together in a single memory region, then
deallocating the object will automatically cause the data buffer to be
deallocated as well, without the array destructor having to take any
special effort.
...
It is the change to the ptr of the just created PyArrayObject that caused
problem with the interface deprecation. I fixed all other problem releated
to the deprecation (mostly just rename of function/macro). But I didn't
fixed this one yet. I would need to change the logic to compute the final
ptr before creating the PyArrayObject object and create it with the final
data ptr. But in call cases, NumPy didn't allocated data memory for this
object, so this case don't block your optimization.
Right.
...
One thing in our optimization "wish list" is to reuse allocated
PyArrayObject between Theano function call for intermediate results(so
completly under Theano control). This could be useful in particular for
reshape/transpose/subtensor. Those functions are pretty fast and from
memory, I already found the allocation time was significant. But in those
cases, it is on PyArrayObject that are views, so the metadata and the data
would be in different memory region in all cases.
The other cases of optimization "wish list"  is if  we want to reuse the
PyArrayObject when the shape isn't the good one (but the number of
dimensions is the same). If we do that for operation like addition, we will
need to use PyArray_Resize(). This will be done on PyArrayObject whose data
memory was allocated by NumPy. So if you do one memory allowcation for
metadata and data, just make sure that PyArray_Resize() will handle that
correctly.
I'm not sure I follow the details here, but it does turn out that a
really surprising amount of time in PyArray_NewFromDescr is spent in
just calculating and writing out the shape and strides buffers, so for
programs that e.g. use hundreds of small 3-element arrays to represent
points in space, re-using even these buffers might be a big win...
...
On the usefulness of doing only 1 memory allocation, on our old gpu ndarray,
we where doing 2 alloc on the GPU, one for metadata and one for data. I
removed this, as this was a bottleneck. allocation on the CPU are faster the
on the GPU, but this is still something that is slow except if you reuse
memory. Do PyMem_Malloc, reuse previous small allocation?
Yes, at least in theory PyMem_Malloc is highly-optimized for small
buffer re-use. (For requests >256 bytes it just calls malloc().) And
it's possible to define type-specific freelists; not sure if there's
any value in doing that for PyArrayObjects. See Objects/obmalloc.c in
the Python source tree.
-n
PyMem_Malloc is just a wrapper around malloc, so its only as optimized
as the c library is (glibc is not good for small allocations).
PyObject_Malloc uses a small object allocator for requests smaller 512
bytes (256 in python2).
I filed a pull request [0] replacing a few functions which I think are
safe to convert to this API. The nditer allocation which is completely
encapsulated and the construction of the scalar and array python objects
which are deleted via the tp_free slot (we really should not support
third party libraries using PyMem_Free on python objects without checks).
This already gives up to 15% improvements for scalar operations compared
to glibc 2.17 malloc.
Do I understand the discussions here right that we could replace
PyDimMem_NEW  which is used for strides in PyArray with the small object
allocation too?
It would still allow swapping the stride buffer, but every application
must then delete it with PyDimMem_FREE which should be a reasonable
requirement.
[0] https://github.com/numpy/numpy/pull/4177
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion