Re: [Numpy-discussion] Speedup by avoiding memory alloc twice in scalar array

8 Jan 2014

      On Wed, Jan 8, 2014 at 12:13 PM, Julian Taylor
 wrote:
...
On 18.07.2013 15:36, Nathaniel Smith wrote:
...
On Wed, Jul 17, 2013 at 5:57 PM, Frédéric Bastien  wrote:
...
On the usefulness of doing only 1 memory allocation, on our old gpu ndarray,
we where doing 2 alloc on the GPU, one for metadata and one for data. I
removed this, as this was a bottleneck. allocation on the CPU are faster the
on the GPU, but this is still something that is slow except if you reuse
memory. Do PyMem_Malloc, reuse previous small allocation?
Yes, at least in theory PyMem_Malloc is highly-optimized for small
buffer re-use. (For requests >256 bytes it just calls malloc().) And
it's possible to define type-specific freelists; not sure if there's
any value in doing that for PyArrayObjects. See Objects/obmalloc.c in
the Python source tree.
PyMem_Malloc is just a wrapper around malloc, so its only as optimized
as the c library is (glibc is not good for small allocations).
PyObject_Malloc uses a small object allocator for requests smaller 512
bytes (256 in python2).
Right, I meant PyObject_Malloc of course.
...
I filed a pull request [0] replacing a few functions which I think are
safe to convert to this API. The nditer allocation which is completely
encapsulated and the construction of the scalar and array python objects
which are deleted via the tp_free slot (we really should not support
third party libraries using PyMem_Free on python objects without checks).
This already gives up to 15% improvements for scalar operations compared
to glibc 2.17 malloc.
Do I understand the discussions here right that we could replace
PyDimMem_NEW  which is used for strides in PyArray with the small object
allocation too?
It would still allow swapping the stride buffer, but every application
must then delete it with PyDimMem_FREE which should be a reasonable
requirement.
That sounds reasonable to me.

If we wanted to get even more elaborate, we could by default stick the
shape/strides into the same allocation as the PyArrayObject, and then
defer allocating a separate buffer until someone actually calls
PyArray_Resize. (With a new flag, similar to OWNDATA, that tells us
whether we need to free the shape/stride buffer when deallocating the
array.) It's got to be a vanishingly small proportion of arrays where
PyArray_Resize is actually called, so for most arrays, this would let
us skip the allocation entirely, and the only cost would be that for
arrays where PyArray_Resize *is* called to add new dimensions, we'd
leave the original buffers sitting around until the array was freed,
wasting a tiny amount of memory. Given that no-one has noticed that
currently *every* array wastes 50% of this much memory (see upthread),
I doubt anyone will care...

-- 
Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org