On 18.07.2013 15:36, Nathaniel Smith wrote:
On Wed, Jul 17, 2013 at 5:57 PM, Frédéric Bastien firstname.lastname@example.org wrote:
On Wed, Jul 17, 2013 at 10:39 AM, Nathaniel Smith email@example.com wrote:
On Tue, Jul 16, 2013 at 11:55 AM, Nathaniel Smith firstname.lastname@example.org wrote:
It's entirely possible I misunderstood, so let's see if we can work it out. I know that you want to assign to the ->data pointer in a PyArrayObject, right? That's what caused some trouble with the 1.7 API deprecations, which were trying to prevent direct access to this field? Creating a new array given a pointer to a memory region is no problem, and obviously will be supported regardless of any optimizations. But if that's all you were doing then you shouldn't have run into the deprecation problem. Or maybe I'm misremembering!
What is currently done at only 1 place is to create a new PyArrayObject with a given ptr. So NumPy don't do the allocation. We later change that ptr to another one.
Hmm, OK, so that would still work. If the array has the OWNDATA flag set (or you otherwise know where the data came from), then swapping the data pointer would still work.
The change would be that in most cases when asking numpy to allocate a new array from scratch, the OWNDATA flag would not be set. That's because the OWNDATA flag really means "when this object is deallocated, call free(self->data)", but if we allocate the array struct and the data buffer together in a single memory region, then deallocating the object will automatically cause the data buffer to be deallocated as well, without the array destructor having to take any special effort.
It is the change to the ptr of the just created PyArrayObject that caused problem with the interface deprecation. I fixed all other problem releated to the deprecation (mostly just rename of function/macro). But I didn't fixed this one yet. I would need to change the logic to compute the final ptr before creating the PyArrayObject object and create it with the final data ptr. But in call cases, NumPy didn't allocated data memory for this object, so this case don't block your optimization.
One thing in our optimization "wish list" is to reuse allocated PyArrayObject between Theano function call for intermediate results(so completly under Theano control). This could be useful in particular for reshape/transpose/subtensor. Those functions are pretty fast and from memory, I already found the allocation time was significant. But in those cases, it is on PyArrayObject that are views, so the metadata and the data would be in different memory region in all cases.
The other cases of optimization "wish list" is if we want to reuse the PyArrayObject when the shape isn't the good one (but the number of dimensions is the same). If we do that for operation like addition, we will need to use PyArray_Resize(). This will be done on PyArrayObject whose data memory was allocated by NumPy. So if you do one memory allowcation for metadata and data, just make sure that PyArray_Resize() will handle that correctly.
I'm not sure I follow the details here, but it does turn out that a really surprising amount of time in PyArray_NewFromDescr is spent in just calculating and writing out the shape and strides buffers, so for programs that e.g. use hundreds of small 3-element arrays to represent points in space, re-using even these buffers might be a big win...
On the usefulness of doing only 1 memory allocation, on our old gpu ndarray, we where doing 2 alloc on the GPU, one for metadata and one for data. I removed this, as this was a bottleneck. allocation on the CPU are faster the on the GPU, but this is still something that is slow except if you reuse memory. Do PyMem_Malloc, reuse previous small allocation?
Yes, at least in theory PyMem_Malloc is highly-optimized for small buffer re-use. (For requests >256 bytes it just calls malloc().) And it's possible to define type-specific freelists; not sure if there's any value in doing that for PyArrayObjects. See Objects/obmalloc.c in the Python source tree.
PyMem_Malloc is just a wrapper around malloc, so its only as optimized as the c library is (glibc is not good for small allocations). PyObject_Malloc uses a small object allocator for requests smaller 512 bytes (256 in python2).
I filed a pull request  replacing a few functions which I think are safe to convert to this API. The nditer allocation which is completely encapsulated and the construction of the scalar and array python objects which are deleted via the tp_free slot (we really should not support third party libraries using PyMem_Free on python objects without checks).
This already gives up to 15% improvements for scalar operations compared to glibc 2.17 malloc. Do I understand the discussions here right that we could replace PyDimMem_NEW which is used for strides in PyArray with the small object allocation too? It would still allow swapping the stride buffer, but every application must then delete it with PyDimMem_FREE which should be a reasonable requirement.