Eric Wieser wrote
Yes, sorry, had been a while since I had looked it up:
https://docs.python.org/3/c-api/memory.html#c.PyMemAllocatorEx
That `PyMemAllocatorEx` looks almost exactly like one of the two variants I was proposing. Is there a reason for wanting to define our own structure vs just using that one? I think the NEP should at least offer a brief comparison to that structure, even if we ultimately end up not using it.
Agreed. Eric Wieser wrote
But right now the proposal says this is static, and I honestly don't see much reason for it to be freeable? The current use-cases `cupy` or `pnumpy` don't not seem to need it.
I don't know much about either of these use cases, so the following is speculative. In cupy, presumably the application is to tie allocation to a specific GPU device. Presumably then, somewhere in the python code there is a handle to a GPU object, through which the allocators operate. If that handle is stored in the allocator, and the allocator is freeable, then it is possible to write code that automatically releases the GPU handle after the allocator has been restored to the default and the last array using it is cleaned up.
If that cupy use-case seems somwhat plausible, then I think we should go with the PyObject approach. If it doesn't seem plausible, then I think the `ctx` approach is acceptable, and we should consider declaring our struct ```struct { PyMemAllocatorEx allocator; char const *name; }``` to reuse the existing python API unless there's a reason not to.
Coming as a CuPy contributor here. The discussion of using this new NEP is not yet finalized in CuPy, so I am only speaking for the potential usage that I conceived. The original idea of using a custom NumPy allocator in CuPy (or any GPU library) is to allocate pinned / page-locked memory, which is on host (CPU). The idea is to explore the fact that device-host transfer is faster when pinned memory is in use. So, if I am calling arr_cpu = cupy.asnumpy(arr_gpu) to create a NumPy array and make a D2H transfer, and if I know arr_cpu's buffer is going to be reused several times, then it's better for it to be backed by pinned memory from the beginning. While there are tricks to achieve this, such a use pattern can be quite common in user codes, so it's much easier if the allocator can be configurable to avoid repeating boilerplates. An interesting note: this new custom allocator can be used to allocate managed/unified memory from CUDA. This memory lives in a unified address space so that both CPU and GPU can access it. I do not have much to say about this use case however. Now, I am not fully sure we need `void* ctx` or even make it a `PyObject`. My understanding (correct me if I am wrong!) is that the allocator state is considered internal. Imagine I set `alloc` in `PyDataMem_Handler` to be `alloc_my_mempool`, which has access to the internal of a memory pool class that manage a pool of pinned memory. Then whatever information should just be kept inside my mempool (including alignment, pool size, etc). I could implement the pool as a C++ class, and expose the alloc/free/etc member functions to C with some hacks. If using Cython, I suppose it's less hacky to expose a method of a cdef class. On the other hand, for pure C code life is probably easier if ctx is there. One way or another someone must keep a unique instance of that struct or class alive, so I do not have strong opinion. Best, Leo -- Sent from: http://numpy-discussion.10968.n7.nabble.com/