On Wed, Jan 8, 2014 at 12:13 PM, Julian Taylor <jtaylor.debian@googlemail.com> wrote:
On 18.07.2013 15:36, Nathaniel Smith wrote:
On Wed, Jul 17, 2013 at 5:57 PM, Frédéric Bastien <nouiz@nouiz.org> wrote:
On the usefulness of doing only 1 memory allocation, on our old gpu ndarray, we where doing 2 alloc on the GPU, one for metadata and one for data. I removed this, as this was a bottleneck. allocation on the CPU are faster the on the GPU, but this is still something that is slow except if you reuse memory. Do PyMem_Malloc, reuse previous small allocation?
Yes, at least in theory PyMem_Malloc is highly-optimized for small buffer re-use. (For requests >256 bytes it just calls malloc().) And it's possible to define type-specific freelists; not sure if there's any value in doing that for PyArrayObjects. See Objects/obmalloc.c in the Python source tree.
PyMem_Malloc is just a wrapper around malloc, so its only as optimized as the c library is (glibc is not good for small allocations). PyObject_Malloc uses a small object allocator for requests smaller 512 bytes (256 in python2).
Right, I meant PyObject_Malloc of course.
I filed a pull request [0] replacing a few functions which I think are safe to convert to this API. The nditer allocation which is completely encapsulated and the construction of the scalar and array python objects which are deleted via the tp_free slot (we really should not support third party libraries using PyMem_Free on python objects without checks).
This already gives up to 15% improvements for scalar operations compared to glibc 2.17 malloc. Do I understand the discussions here right that we could replace PyDimMem_NEW which is used for strides in PyArray with the small object allocation too? It would still allow swapping the stride buffer, but every application must then delete it with PyDimMem_FREE which should be a reasonable requirement.
That sounds reasonable to me. If we wanted to get even more elaborate, we could by default stick the shape/strides into the same allocation as the PyArrayObject, and then defer allocating a separate buffer until someone actually calls PyArray_Resize. (With a new flag, similar to OWNDATA, that tells us whether we need to free the shape/stride buffer when deallocating the array.) It's got to be a vanishingly small proportion of arrays where PyArray_Resize is actually called, so for most arrays, this would let us skip the allocation entirely, and the only cost would be that for arrays where PyArray_Resize *is* called to add new dimensions, we'd leave the original buffers sitting around until the array was freed, wasting a tiny amount of memory. Given that no-one has noticed that currently *every* array wastes 50% of this much memory (see upthread), I doubt anyone will care... -- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org