[Numpy-discussion] Speedup by avoiding memory alloc twice in scalar array
njs at pobox.com
Wed Jan 8 15:40:26 EST 2014
On Wed, Jan 8, 2014 at 12:13 PM, Julian Taylor
<jtaylor.debian at googlemail.com> wrote:
> On 18.07.2013 15:36, Nathaniel Smith wrote:
>> On Wed, Jul 17, 2013 at 5:57 PM, Frédéric Bastien <nouiz at nouiz.org> wrote:
>>> On the usefulness of doing only 1 memory allocation, on our old gpu ndarray,
>>> we where doing 2 alloc on the GPU, one for metadata and one for data. I
>>> removed this, as this was a bottleneck. allocation on the CPU are faster the
>>> on the GPU, but this is still something that is slow except if you reuse
>>> memory. Do PyMem_Malloc, reuse previous small allocation?
>> Yes, at least in theory PyMem_Malloc is highly-optimized for small
>> buffer re-use. (For requests >256 bytes it just calls malloc().) And
>> it's possible to define type-specific freelists; not sure if there's
>> any value in doing that for PyArrayObjects. See Objects/obmalloc.c in
>> the Python source tree.
> PyMem_Malloc is just a wrapper around malloc, so its only as optimized
> as the c library is (glibc is not good for small allocations).
> PyObject_Malloc uses a small object allocator for requests smaller 512
> bytes (256 in python2).
Right, I meant PyObject_Malloc of course.
> I filed a pull request  replacing a few functions which I think are
> safe to convert to this API. The nditer allocation which is completely
> encapsulated and the construction of the scalar and array python objects
> which are deleted via the tp_free slot (we really should not support
> third party libraries using PyMem_Free on python objects without checks).
> This already gives up to 15% improvements for scalar operations compared
> to glibc 2.17 malloc.
> Do I understand the discussions here right that we could replace
> PyDimMem_NEW which is used for strides in PyArray with the small object
> allocation too?
> It would still allow swapping the stride buffer, but every application
> must then delete it with PyDimMem_FREE which should be a reasonable
That sounds reasonable to me.
If we wanted to get even more elaborate, we could by default stick the
shape/strides into the same allocation as the PyArrayObject, and then
defer allocating a separate buffer until someone actually calls
PyArray_Resize. (With a new flag, similar to OWNDATA, that tells us
whether we need to free the shape/stride buffer when deallocating the
array.) It's got to be a vanishingly small proportion of arrays where
PyArray_Resize is actually called, so for most arrays, this would let
us skip the allocation entirely, and the only cost would be that for
arrays where PyArray_Resize *is* called to add new dimensions, we'd
leave the original buffers sitting around until the array was freed,
wasting a tiny amount of memory. Given that no-one has noticed that
currently *every* array wastes 50% of this much memory (see upthread),
I doubt anyone will care...
Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
More information about the NumPy-Discussion