[Numpy-discussion] Proposal to accept NEP 49: Data allocation strategies

Wed May 12 16:56:59 EDT 2021

Eric Wieser wrote
>> Yes, sorry, had been a while since I had looked it up:
>>
>> https://docs.python.org/3/c-api/memory.html#c.PyMemAllocatorEx
> 
> That `PyMemAllocatorEx` looks almost exactly like one of the two variants
> I was proposing. Is there a reason for wanting to define our own structure
> vs just using that one?
> I think the NEP should at least offer a brief comparison to that
> structure, even if we ultimately end up not using it.

Agreed.

Eric Wieser wrote
>> But right now the proposal says this is static, and I honestly don't
>> see much reason for it to be freeable?  The current use-cases `cupy` or
>> `pnumpy` don't not seem to need it.
> 
> I don't know much about either of these use cases, so the following is
> speculative.
> In cupy, presumably the application is to tie allocation to a specific GPU
> device.
> Presumably then, somewhere in the python code there is a handle to a GPU
> object, through which the allocators operate.
> If that handle is stored in the allocator, and the allocator is freeable,
> then it is possible to write code that automatically releases the GPU
> handle after the allocator has been restored to the default and the last
> array using it is cleaned up.
> 
> If that cupy use-case seems somwhat plausible, then I think we should go
> with the PyObject approach.
> If it doesn't seem plausible, then I think the `ctx` approach is
> acceptable, and we should consider declaring our struct
> ```struct { PyMemAllocatorEx allocator; char const *name; }``` to reuse
> the
> existing python API unless there's a reason not to.

Coming as a CuPy contributor here. The discussion of using this new NEP is
not yet finalized in CuPy,
so I am only speaking for the potential usage that I conceived.

The original idea of using a custom NumPy allocator in CuPy (or any GPU
library) is to allocate pinned /
page-locked memory, which is on host (CPU). The idea is to explore the fact
that device-host transfer
is faster when pinned memory is in use. So, if I am calling 

arr_cpu = cupy.asnumpy(arr_gpu)

to create a NumPy array and make a D2H transfer, and if I know arr_cpu's
buffer is going to be
reused several times, then it's better for it to be backed by pinned memory
from the beginning. While
there are tricks to achieve this, such a use pattern can be quite common in
user codes, so it's much
easier if the allocator can be configurable to avoid repeating boilerplates.

An interesting note: this new custom allocator can be used to allocate
managed/unified memory from
CUDA. This memory lives in a unified address space so that both CPU and GPU
can access it. I do not
have much to say about this use case however.

Now, I am not fully sure we need `void* ctx` or even make it a `PyObject`.
My understanding 
(correct me if I am wrong!) is that the allocator state is considered
internal. Imagine I set `alloc` in
`PyDataMem_Handler` to be `alloc_my_mempool`, which has access to the
internal of a memory pool
class that manage a pool of pinned memory. Then whatever information should
just be kept inside
my mempool (including alignment, pool size, etc). I could implement the pool
as a C++ class, and
expose the alloc/free/etc member functions to C with some hacks. If using
Cython, I suppose it's less
hacky to expose a method of a cdef class. On the other hand, for pure C code
life is probably easier
if ctx is there. One way or another someone must keep a unique instance of
that struct or class alive,
so I do not have strong opinion.

Best,
Leo

--
Sent from: http://numpy-discussion.10968.n7.nabble.com/