Proposal to accept NEP 49: Data allocation strategies
Here is the current rendering of the NEP:https://numpy.org/neps/nep-0049.html The mailing list discussion, started on April 20 did not bring up any objections to the proposal, nor were there objections in the discussion around the text of the NEP. There were questions around details of the implementation, thank you reviewers for carefully looking at them and suggesting improvements. If there are no substantive objections within 7 days from this email, then the NEP will be accepted; see NEP 0 for more details. Matti
The NEP looks good, but I worry the API isn't flexible enough. My two main concerns are: ### Stateful allocators Consider an allocator that aligns to `N` bytes, where `N` is configurable from a python call in someone else's extension module. Where do they store `N`? They can hide it in `PyDataMem_Handler::name` but that's obviously an abuse of the API. They can store it as a global variable, but then obviously the idea of tracking the allocator used to construct an array doesn't work, as the state ends up changing with the global allocator. The easy way out here would be to add a `void* context` field to the structure, and pass it into all the methods. This doesn't really solve the problem though, as now there's no way to cleanup any allocations used to populate `context`, or worse decrement references to python objects stored within `context`. I think we want to bundle `PyDataMem_Handler` in a `PyObject` somehow, either via a new C type, or by using the PyCapsule API which has the cleanup and state hooks we need. `PyDataMem_GetHandlerName` would then return this PyObject rather than an opaque name. For a more exotic case - consider a file-backed allocator, that is constructed from a python `mmap` object which manages blocks within that mmap. The allocator needs to keep a reference to the `mmap` object alive until all the arrays allocated within it are gone, but probably shouldn't leak a reference to it either. ### Thread and async-local allocators For tracing purposes, I expect it to be valuable to be able to configure the allocator within a single thread / coroutine. If we want to support this, we'd most likely want to work with the PEP567 ContextVar API rather than a half-baked thread_local solution that doesn't work for async code. This problem isn't as pressing as the statefulness problem. Fixing it would amount to extending the `PyDataMem_SetHandler` API, and would be unlikely to break any code written against the current version of the NEP; meaning it would be fine to leave as a follow-up. It might still be worth remarking upon as future work of some kind in the NEP. Eric On Thu, 6 May 2021 at 11:41, Matti Picus <matti.picus@gmail.com> wrote:
Here is the current rendering of the NEP:https://numpy.org/neps/nep-0049.html
The mailing list discussion, started on April 20 did not bring up any objections to the proposal, nor were there objections in the discussion around the text of the NEP. There were questions around details of the implementation, thank you reviewers for carefully looking at them and suggesting improvements.
If there are no substantive objections within 7 days from this email, then the NEP will be accepted; see NEP 0 for more details.
Matti
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
On 6/5/21 2:07 pm, Eric Wieser wrote:
The NEP looks good, but I worry the API isn't flexible enough. My two main concerns are:
### Stateful allocators
Consider an allocator that aligns to `N` bytes, where `N` is configurable from a python call in someone else's extension module. ...
### Thread and async-local allocators
For tracing purposes, I expect it to be valuable to be able to configure the allocator within a single thread / coroutine. If we want to support this, we'd most likely want to work with the PEP567 ContextVar API rather than a half-baked thread_local solution that doesn't work for async code.
This problem isn't as pressing as the statefulness problem. Fixing it would amount to extending the `PyDataMem_SetHandler` API, and would be unlikely to break any code written against the current version of the NEP; meaning it would be fine to leave as a follow-up. It might still be worth remarking upon as future work of some kind in the NEP.
I would prefer to leave both of these to a future extension for the NEP. Setting the alignment from a python-level call seems to be asking for trouble, and I would need to be convinced that the extra layer of flexibility is worth it. It might be worth mentioning that this NEP may be extended in the future, but truthfully I think that is the case for all NEPs. Matti
Another argument for supporting stateful allocators would be compatibility with the stateful C++11 allocator API, such as https://en.cppreference.com/w/cpp/memory/allocator_traits/allocate. Adding support for stateful allocators at a later date would almost certainly create an ABI breakage or lots of pain around avoiding one. I haven't thought very much about the PyCapsule approach (although it appears some other reviewers on github considered it at one point), but even building it from scratch, the overhead to support statefulness is not large. As I demonstrate on the github issue (18805), would amount to changing the API from: ```C // the version in the NEP typedef void *(PyDataMem_AllocFunc)(size_t size); typedef void *(PyDataMem_ZeroedAllocFunc)(size_t nelems, size_t elsize); typedef void (PyDataMem_FreeFunc)(void *ptr, size_t size); typedef void *(PyDataMem_ReallocFunc)(void *ptr, size_t size); typedef struct { char name[200]; PyDataMem_AllocFunc *alloc; PyDataMem_ZeroedAllocFunc *zeroed_alloc; PyDataMem_FreeFunc *free; PyDataMem_ReallocFunc *realloc; } PyDataMem_HandlerObject; const PyDataMem_Handler * PyDataMem_SetHandler(PyDataMem_Handler *handler); const char * PyDataMem_GetHandlerName(PyArrayObject *obj); ``` to ```C // proposed changes: a `PyObject *self` argument pointing to a `PyDataMem_HandlerObject` and a ` PyObject_HEAD` typedef void *(PyDataMem_AllocFunc)(PyObject *self, size_t size); typedef void *(PyDataMem_ZeroedAllocFunc)(PyObject *self, size_t nelems, size_t elsize); typedef void (PyDataMem_FreeFunc)(PyObject *self, void *ptr, size_t size); typedef void *(PyDataMem_ReallocFunc)(PyObject *self, void *ptr, size_t size); typedef struct { PyObject_HEAD PyDataMem_AllocFunc *alloc; PyDataMem_ZeroedAllocFunc *zeroed_alloc; PyDataMem_FreeFunc *free; PyDataMem_ReallocFunc *realloc; } PyDataMem_HandlerObject; // steals a reference to handler, caller is responsible for decrefing the result PyDataMem_Handler * PyDataMem_SetHandler(PyDataMem_Handler *handler); // borrowed reference PyDataMem_Handler * PyDataMem_GetHandler(PyArrayObject *obj); // some boilerplate that numpy is already full of and doesn't impact users of non-stateful allocators PyTypeObject PyDataMem_HandlerType = ...; ``` When constructing an array, the reference count of the handler would be incremented before storing it in the array struct Since the extra work now to support this is not awful, but the potential for ABI headaches down the road is, I think we should aim to support statefulness right from the start. The runtime overhead of the stateful approach above vs the NEP approach is negligible, and consists of: * Some overhead costs for setting up an allocator. This likely only happens near startup, so won't matter. * An extra incref on each array allocation * An extra pointer argument on the stack for each allocation and deallocation * Perhaps around 32 extra bytes per allocator objects. Since arrays just store pointers to allocators this doesn't matter. Eric On Thu, 6 May 2021 at 12:43, Matti Picus <matti.picus@gmail.com> wrote:
On 6/5/21 2:07 pm, Eric Wieser wrote:
The NEP looks good, but I worry the API isn't flexible enough. My two main concerns are:
### Stateful allocators
Consider an allocator that aligns to `N` bytes, where `N` is configurable from a python call in someone else's extension module. ...
### Thread and async-local allocators
For tracing purposes, I expect it to be valuable to be able to configure the allocator within a single thread / coroutine. If we want to support this, we'd most likely want to work with the PEP567 ContextVar API rather than a half-baked thread_local solution that doesn't work for async code.
This problem isn't as pressing as the statefulness problem. Fixing it would amount to extending the `PyDataMem_SetHandler` API, and would be unlikely to break any code written against the current version of the NEP; meaning it would be fine to leave as a follow-up. It might still be worth remarking upon as future work of some kind in the NEP.
I would prefer to leave both of these to a future extension for the NEP. Setting the alignment from a python-level call seems to be asking for trouble, and I would need to be convinced that the extra layer of flexibility is worth it.
It might be worth mentioning that this NEP may be extended in the future, but truthfully I think that is the case for all NEPs.
Matti
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
On Thu, 2021-05-06 at 13:06 +0100, Eric Wieser wrote:
Another argument for supporting stateful allocators would be compatibility with the stateful C++11 allocator API, such as https://en.cppreference.com/w/cpp/memory/allocator_traits/allocate.
The Python version of this does have a `void *ctx`, but I am not sure if the use for this is actually valuable for the NumPy use-cases. (Honestly, beyond aligned allocation, or memory pinning, I am uncertain what those use-cases are). I had more written, but maybe just keep it short: While I like the `PyObject *` idea, I am also not sure that it helps much. If we want allocation specific state, the user should overallocate and save it before the actual allocation. I am sure there could be extensions in the future (although I don't know what exactly). I am not super worried about it, its fairly niche and we can probably figure out ways to deprecate an old way of registration and slowly replace it with a new way. But if we don't mind the churn it creates, the only serious idea I would have right now is using a `FromSpec` API. The only difference would be that we allocate the struct and (for now) return something that is fully opaque (we could allow get/set functions on it though). In fact, we could even keep the current struct largely unchanged but change it to be the main "spec", with no actual slots currently necessary (could even be a `void *slots` that is always NULL). (slots are a bit unfortunate, since they cast to `void *` making compile time type checking harder, but overall I think its OK and something we will be using more anyway for DTypes.) I am not sure it is worth it, but if there are no arguments why we cannot allocate the struct, that seems fine. If the return value is opaque, we even have the ability to turn it into a proper Python object if we want to. Cheers, Sebastian
Adding support for stateful allocators at a later date would almost certainly create an ABI breakage or lots of pain around avoiding one.
I haven't thought very much about the PyCapsule approach (although it appears some other reviewers on github considered it at one point), but even building it from scratch, the overhead to support statefulness is not large. As I demonstrate on the github issue (18805), would amount to changing the API from: ```C // the version in the NEP typedef void *(PyDataMem_AllocFunc)(size_t size); typedef void *(PyDataMem_ZeroedAllocFunc)(size_t nelems, size_t elsize); typedef void (PyDataMem_FreeFunc)(void *ptr, size_t size); typedef void *(PyDataMem_ReallocFunc)(void *ptr, size_t size); typedef struct { char name[200]; PyDataMem_AllocFunc *alloc; PyDataMem_ZeroedAllocFunc *zeroed_alloc; PyDataMem_FreeFunc *free; PyDataMem_ReallocFunc *realloc; } PyDataMem_HandlerObject; const PyDataMem_Handler * PyDataMem_SetHandler(PyDataMem_Handler *handler); const char * PyDataMem_GetHandlerName(PyArrayObject *obj); ``` to ```C // proposed changes: a `PyObject *self` argument pointing to a `PyDataMem_HandlerObject` and a ` PyObject_HEAD` typedef void *(PyDataMem_AllocFunc)(PyObject *self, size_t size); typedef void *(PyDataMem_ZeroedAllocFunc)(PyObject *self, size_t nelems, size_t elsize); typedef void (PyDataMem_FreeFunc)(PyObject *self, void *ptr, size_t size); typedef void *(PyDataMem_ReallocFunc)(PyObject *self, void *ptr, size_t size); typedef struct { PyObject_HEAD PyDataMem_AllocFunc *alloc; PyDataMem_ZeroedAllocFunc *zeroed_alloc; PyDataMem_FreeFunc *free; PyDataMem_ReallocFunc *realloc; } PyDataMem_HandlerObject; // steals a reference to handler, caller is responsible for decrefing the result PyDataMem_Handler * PyDataMem_SetHandler(PyDataMem_Handler *handler); // borrowed reference PyDataMem_Handler * PyDataMem_GetHandler(PyArrayObject *obj);
// some boilerplate that numpy is already full of and doesn't impact users of non-stateful allocators PyTypeObject PyDataMem_HandlerType = ...; ``` When constructing an array, the reference count of the handler would be incremented before storing it in the array struct
Since the extra work now to support this is not awful, but the potential for ABI headaches down the road is, I think we should aim to support statefulness right from the start. The runtime overhead of the stateful approach above vs the NEP approach is negligible, and consists of: * Some overhead costs for setting up an allocator. This likely only happens near startup, so won't matter. * An extra incref on each array allocation * An extra pointer argument on the stack for each allocation and deallocation * Perhaps around 32 extra bytes per allocator objects. Since arrays just store pointers to allocators this doesn't matter.
Eric
On Thu, 6 May 2021 at 12:43, Matti Picus <matti.picus@gmail.com> wrote:
On 6/5/21 2:07 pm, Eric Wieser wrote:
The NEP looks good, but I worry the API isn't flexible enough. My two main concerns are:
### Stateful allocators
Consider an allocator that aligns to `N` bytes, where `N` is configurable from a python call in someone else's extension module. ...
### Thread and async-local allocators
For tracing purposes, I expect it to be valuable to be able to configure the allocator within a single thread / coroutine. If we want to support this, we'd most likely want to work with the PEP567 ContextVar API rather than a half-baked thread_local solution that doesn't work for async code.
This problem isn't as pressing as the statefulness problem. Fixing it would amount to extending the `PyDataMem_SetHandler` API, and would be unlikely to break any code written against the current version of the NEP; meaning it would be fine to leave as a follow-up. It might still be worth remarking upon as future work of some kind in the NEP.
I would prefer to leave both of these to a future extension for the NEP. Setting the alignment from a python-level call seems to be asking for trouble, and I would need to be convinced that the extra layer of flexibility is worth it.
It might be worth mentioning that this NEP may be extended in the future, but truthfully I think that is the case for all NEPs.
Matti
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
The Python version of this does have a `void *ctx`, but I am not sure if the use for this is actually valuable for the NumPy use-cases.
Do you mean "the CPython version"? If so, can you link a reference?
While I like the `PyObject *` idea, I am also not sure that it helps much. If we want allocation specific state, the user should overallocate and save it before the actual allocation.
I was talking about allocator- not allocation- specific state. I agree that the correct place to store the latter is by overallocating, but it doesn't make much sense to me to duplicate state about the allocator itself in each allocation.
But if we don't mind the churn it creates, the only serious idea I would have right now is using a `FromSpec` API. We could allow get/set functions on it though
We don't even need to go as far as a flexible `FromSpec` API. Simply having a function to allocate (and free) the opaque struct and a handful of getters ought to be enough to let us change the allocator to be stateful in future. On the other hand, this is probably about as much work as just making it a PyObject in the first place. Eric
On Mon, 2021-05-10 at 10:01 +0100, Eric Wieser wrote:
The Python version of this does have a `void *ctx`, but I am not sure if the use for this is actually valuable for the NumPy use-cases.
Do you mean "the CPython version"? If so, can you link a reference?
Yes, sorry, had been a while since I had looked it up: https://docs.python.org/3/c-api/memory.html#c.PyMemAllocatorEx That all looks like it can be customized in theory. But I am not sure that it is practical, except for hooking and calling the previous one. (But we also have tracemalloc anyway?) I have to say it feels a bit like exposing things publicly, that are really mainly used internally, but not sure... Presumably Python uses the `ctx` for something though.
While I like the `PyObject *` idea, I am also not sure that it helps much. If we want allocation specific state, the user should overallocate and save it before the actual allocation.
I was talking about allocator- not allocation- specific state. I agree that the correct place to store the latter is by overallocating, but it doesn't make much sense to me to duplicate state about the allocator itself in each allocation.
Right, I don't really know a use-case right now. But I am fine with saying: lets pass in some state anyway, to future-proof. Although if we ensure that the API can be extended, even that is probably not really necessary, unless we have a faint idea how it would be used? (I guess the C++ similarity may be a reason, but I am not familiar with that.)
But if we don't mind the churn it creates, the only serious idea I would have right now is using a `FromSpec` API. We could allow get/set functions on it though
We don't even need to go as far as a flexible `FromSpec` API. Simply having a function to allocate (and free) the opaque struct and a handful of getters ought to be enough to let us change the allocator to be stateful in future. On the other hand, this is probably about as much work as just making it a PyObject in the first place.
Yeah, if we don't expect things to grow often/much, we can just use what we have now and either add `NULL` argument at the end and/or just make a new function when we need it. The important part would be returning a new struct. I think even opaque is not necessary! If we return the new struct, we can extend it freely and return NULL to indicate an error (thus being able to deprecate if we have to). Right now we don't even have getters in the proposal IIRC, so that part probably just doesn't matter either. (If we want to allow to fall back to the previous allocator this would have to be expanded.) I agree that `PyObject *` is probably just as well if you want the struct to be free'able since then you suddenly need reference counting or similar! But right now the proposal says this is static, and I honestly don't see much reason for it to be freeable? The current use-cases `cupy` or `pnumpy` don't not seem to need it. If we return a new struct (I do not care if opaque or not), all of that can still be expanded. Should we just do that? Or can we think of any downside to that or use-case where this is clearly too limiting right now? Cheers, Sebastian
Eric _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
On 10/5/21 8:43 pm, Sebastian Berg wrote:
But right now the proposal says this is static, and I honestly don't see much reason for it to be freeable?
I think this is the crux of the issue. The current design is for a singly-allocated struct to be passed around since it is just an aggregate of functions. If someone wants a different strategy (i.e. different alignment) they create a new policy: there are no additional parameters or data associated with the struct. I don't really see an ask from possible users for anything more, and so would prefer to remain with the simplest possible design. If the need arises in the future for additional data, which is doubtful, I am confident we can expand this as needed, and do not want to burden the current design with unneeded optional features. It would be nice to hear from some actual users if they need the flexibility. In any case I would like to resolve this quickly and get it into the next release, so if Eric is adamant that the advanced design is needed I will accept his proposal, since that seems easier than any of the alternatives so far. Matti
Yes, sorry, had been a while since I had looked it up:
https://docs.python.org/3/c-api/memory.html#c.PyMemAllocatorEx
That `PyMemAllocatorEx` looks almost exactly like one of the two variants I was proposing. Is there a reason for wanting to define our own structure vs just using that one? I think the NEP should at least offer a brief comparison to that structure, even if we ultimately end up not using it.
That all looks like it can be customized in theory. But I am not sure that it is practical, except for hooking and calling the previous one.
Is chaining allocators not likely something we want to support too? For instance, an allocator that is used for large arrays, but falls back to the previous one for small arrays?
I have to say it feels a bit like exposing things publicly, that are really mainly used internally, but not sure... Presumably Python uses the `ctx` for something though.
I'd argue `ctx` / `baton` / `user_data` arguments are an essential part of any C callback API. I can't find any particularly good reference for this right now, but I have been bitten multiple times by C APIs that forget to add this argument.
If someone wants a different strategy (i.e. different alignment) they create a new policy
I guess the C++ similarity may be a reason, but I am not familiar with
The crux of the problem here is that without very nasty hacks, C and C++ do not allow new functions to be created at runtime. This makes it very awkward to write a parameterizable allocator. If you want to create two aligned allocators with different alignments, and you don't have a `ctx` argument to plumb through that alignment information, you're forced to write the entire thing twice. that. Similarity isn't the only motivation - I was considering compatibility. Consider a user who's already written a shiny stateful C++ allocator, and wants to use it with numpy. I've made a gist at https://gist.github.com/eric-wieser/6d0fde53fc1ba7a2fa4ac208467f2ae5 which demonstrates how to hook an arbitrary C++ allocator into this new numpy allocator API, that compares both the NEP version and the version with an added `ctx` argument. The NEP version has a bug that is very hard to fix without duplicating the entire `numpy_handler_from_cpp_allocator` function. If compatibility with C++ seems too much of a stretch, the NEP API is not even compatible with `PyMemAllocatorEx`.
But right now the proposal says this is static, and I honestly don't see much reason for it to be freeable? The current use-cases `cupy` or `pnumpy` don't not seem to need it.
I don't know much about either of these use cases, so the following is speculative. In cupy, presumably the application is to tie allocation to a specific GPU device. Presumably then, somewhere in the python code there is a handle to a GPU object, through which the allocators operate. If that handle is stored in the allocator, and the allocator is freeable, then it is possible to write code that automatically releases the GPU handle after the allocator has been restored to the default and the last array using it is cleaned up. If that cupy use-case seems somwhat plausible, then I think we should go with the PyObject approach. If it doesn't seem plausible, then I think the `ctx` approach is acceptable, and we should consider declaring our struct ```struct { PyMemAllocatorEx allocator; char const *name; }``` to reuse the existing python API unless there's a reason not to. Eric On Tue, 11 May 2021 at 04:58, Matti Picus <matti.picus@gmail.com> wrote:
On 10/5/21 8:43 pm, Sebastian Berg wrote:
But right now the proposal says this is static, and I honestly don't see much reason for it to be freeable?
I think this is the crux of the issue. The current design is for a singly-allocated struct to be passed around since it is just an aggregate of functions. If someone wants a different strategy (i.e. different alignment) they create a new policy: there are no additional parameters or data associated with the struct. I don't really see an ask from possible users for anything more, and so would prefer to remain with the simplest possible design. If the need arises in the future for additional data, which is doubtful, I am confident we can expand this as needed, and do not want to burden the current design with unneeded optional features.
It would be nice to hear from some actual users if they need the flexibility.
In any case I would like to resolve this quickly and get it into the next release, so if Eric is adamant that the advanced design is needed I will accept his proposal, since that seems easier than any of the alternatives so far.
Matti
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
On Tue, 2021-05-11 at 09:54 +0100, Eric Wieser wrote:
Yes, sorry, had been a while since I had looked it up:
https://docs.python.org/3/c-api/memory.html#c.PyMemAllocatorEx
That `PyMemAllocatorEx` looks almost exactly like one of the two variants I was proposing. Is there a reason for wanting to define our own structure vs just using that one? I think the NEP should at least offer a brief comparison to that structure, even if we ultimately end up not using it.
That all looks like it can be customized in theory. But I am not sure that it is practical, except for hooking and calling the previous one.
Is chaining allocators not likely something we want to support too? For instance, an allocator that is used for large arrays, but falls back to the previous one for small arrays?
I have to say it feels a bit like exposing things publicly, that are really mainly used internally, but not sure... Presumably Python uses the `ctx` for something though.
I'd argue `ctx` / `baton` / `user_data` arguments are an essential part of any C callback API. I can't find any particularly good reference for this right now, but I have been bitten multiple times by C APIs that forget to add this argument.
Can't argue with that :). I am personally still mostly a bit concerned that we have some way to modify/extend in the future (even clunky seems fine). Beyond that, I don't care all that much. Passing a context feels right to me, but neither do I know that we need it. Using PyObject still feels a bit much, but I am also not opposed. I guess for future extension, we would have to subclass ourselves and/or include an ABI version number (if just to avoid `PyObject_TypeCheck` calls to figure out which ABI version we got). Otherwise, either allocating the struct or including a version number (or reserved space) in the struct/PyObject is probably good enough to to ensure we have a path for modifying/extending the ABI. I hope that the actual end-users can chip in and clear it up a bit... Cheers, Sebastian
If someone wants a different strategy (i.e. different alignment) they create a new policy
The crux of the problem here is that without very nasty hacks, C and C++ do not allow new functions to be created at runtime. This makes it very awkward to write a parameterizable allocator. If you want to create two aligned allocators with different alignments, and you don't have a `ctx` argument to plumb through that alignment information, you're forced to write the entire thing twice.
I guess the C++ similarity may be a reason, but I am not familiar with that.
Similarity isn't the only motivation - I was considering compatibility. Consider a user who's already written a shiny stateful C++ allocator, and wants to use it with numpy. I've made a gist at https://gist.github.com/eric-wieser/6d0fde53fc1ba7a2fa4ac208467f2ae5 which demonstrates how to hook an arbitrary C++ allocator into this new numpy allocator API, that compares both the NEP version and the version with an added `ctx` argument. The NEP version has a bug that is very hard to fix without duplicating the entire `numpy_handler_from_cpp_allocator` function.
If compatibility with C++ seems too much of a stretch, the NEP API is not even compatible with `PyMemAllocatorEx`.
But right now the proposal says this is static, and I honestly don't see much reason for it to be freeable? The current use-cases `cupy` or `pnumpy` don't not seem to need it.
I don't know much about either of these use cases, so the following is speculative. In cupy, presumably the application is to tie allocation to a specific GPU device. Presumably then, somewhere in the python code there is a handle to a GPU object, through which the allocators operate. If that handle is stored in the allocator, and the allocator is freeable, then it is possible to write code that automatically releases the GPU handle after the allocator has been restored to the default and the last array using it is cleaned up.
If that cupy use-case seems somwhat plausible, then I think we should go with the PyObject approach. If it doesn't seem plausible, then I think the `ctx` approach is acceptable, and we should consider declaring our struct ```struct { PyMemAllocatorEx allocator; char const *name; }``` to reuse the existing python API unless there's a reason not to.
Eric
On Tue, 11 May 2021 at 04:58, Matti Picus <matti.picus@gmail.com> wrote:
On 10/5/21 8:43 pm, Sebastian Berg wrote:
But right now the proposal says this is static, and I honestly don't see much reason for it to be freeable?
I think this is the crux of the issue. The current design is for a singly-allocated struct to be passed around since it is just an aggregate of functions. If someone wants a different strategy (i.e. different alignment) they create a new policy: there are no additional parameters or data associated with the struct. I don't really see an ask from possible users for anything more, and so would prefer to remain with the simplest possible design. If the need arises in the future for additional data, which is doubtful, I am confident we can expand this as needed, and do not want to burden the current design with unneeded optional features.
It would be nice to hear from some actual users if they need the flexibility.
In any case I would like to resolve this quickly and get it into the next release, so if Eric is adamant that the advanced design is needed I will accept his proposal, since that seems easier than any of the alternatives so far.
Matti
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Eric Wieser wrote
Yes, sorry, had been a while since I had looked it up:
https://docs.python.org/3/c-api/memory.html#c.PyMemAllocatorEx
That `PyMemAllocatorEx` looks almost exactly like one of the two variants I was proposing. Is there a reason for wanting to define our own structure vs just using that one? I think the NEP should at least offer a brief comparison to that structure, even if we ultimately end up not using it.
Agreed. Eric Wieser wrote
But right now the proposal says this is static, and I honestly don't see much reason for it to be freeable? The current use-cases `cupy` or `pnumpy` don't not seem to need it.
I don't know much about either of these use cases, so the following is speculative. In cupy, presumably the application is to tie allocation to a specific GPU device. Presumably then, somewhere in the python code there is a handle to a GPU object, through which the allocators operate. If that handle is stored in the allocator, and the allocator is freeable, then it is possible to write code that automatically releases the GPU handle after the allocator has been restored to the default and the last array using it is cleaned up.
If that cupy use-case seems somwhat plausible, then I think we should go with the PyObject approach. If it doesn't seem plausible, then I think the `ctx` approach is acceptable, and we should consider declaring our struct ```struct { PyMemAllocatorEx allocator; char const *name; }``` to reuse the existing python API unless there's a reason not to.
Coming as a CuPy contributor here. The discussion of using this new NEP is not yet finalized in CuPy, so I am only speaking for the potential usage that I conceived. The original idea of using a custom NumPy allocator in CuPy (or any GPU library) is to allocate pinned / page-locked memory, which is on host (CPU). The idea is to explore the fact that device-host transfer is faster when pinned memory is in use. So, if I am calling arr_cpu = cupy.asnumpy(arr_gpu) to create a NumPy array and make a D2H transfer, and if I know arr_cpu's buffer is going to be reused several times, then it's better for it to be backed by pinned memory from the beginning. While there are tricks to achieve this, such a use pattern can be quite common in user codes, so it's much easier if the allocator can be configurable to avoid repeating boilerplates. An interesting note: this new custom allocator can be used to allocate managed/unified memory from CUDA. This memory lives in a unified address space so that both CPU and GPU can access it. I do not have much to say about this use case however. Now, I am not fully sure we need `void* ctx` or even make it a `PyObject`. My understanding (correct me if I am wrong!) is that the allocator state is considered internal. Imagine I set `alloc` in `PyDataMem_Handler` to be `alloc_my_mempool`, which has access to the internal of a memory pool class that manage a pool of pinned memory. Then whatever information should just be kept inside my mempool (including alignment, pool size, etc). I could implement the pool as a C++ class, and expose the alloc/free/etc member functions to C with some hacks. If using Cython, I suppose it's less hacky to expose a method of a cdef class. On the other hand, for pure C code life is probably easier if ctx is there. One way or another someone must keep a unique instance of that struct or class alive, so I do not have strong opinion. Best, Leo -- Sent from: http://numpy-discussion.10968.n7.nabble.com/
Eric Wieser wrote
Yes, sorry, had been a while since I had looked it up:
https://docs.python.org/3/c-api/memory.html#c.PyMemAllocatorEx
That `PyMemAllocatorEx` looks almost exactly like one of the two variants I was proposing. Is there a reason for wanting to define our own structure vs just using that one? I think the NEP should at least offer a brief comparison to that structure, even if we ultimately end up not using it.
I have to say it feels a bit like exposing things publicly, that are really mainly used internally, but not sure... Presumably Python uses the `ctx` for something though.
I'd argue `ctx` / `baton` / `user_data` arguments are an essential part of any C callback API. I can't find any particularly good reference for this right now, but I have been bitten multiple times by C APIs that forget to add this argument.
If someone wants a different strategy (i.e. different alignment) they create a new policy
The crux of the problem here is that without very nasty hacks, C and C++ do not allow new functions to be created at runtime. This makes it very awkward to write a parameterizable allocator. If you want to create two aligned allocators with different alignments, and you don't have a `ctx` argument to plumb through that alignment information, you're forced to write the entire thing twice.
The `PyMemAllocatorEx` memory API will allow (lambda) closure-like definition of the data mem routines. That's the main idea behind the `ctx` thing, it's huge and will enable every allocation scenario. In my opinion, the rest of the proposals (PyObjects, PyCapsules, etc.) are secondary and could be considered out-of-scope. I would suggest to let people use this before hiding it behind a strict API. Let me also give you an insight of how we plan to do it, since we are the first to integrate this in production code. Considering this NEP as a primitive API, I developed a new project to address our requirements: 1. Provide a Python-native way to define a new numpy allocator 2. Accept data mem routine symbols (function pointers) from open dynamic libraries 3. Allow local-scoped allocation, e.g. inside a `with` statement But since there was not much fun in these, I thought it would be nice if we could exploit `ctypes` callback functions, to allow developers hook into such routines natively (e.g. for debugging/monitoring), or even write them entirely in Python (of course there has to be an underlying memory allocation API). For example, the idea is to be able to define a page-aligned allocator in ~30 lines of Python code, like that: https://github.com/inaccel/numpy-allocator/blob/master/test/aligned_allocato... --- While experimenting with this project I spotted the two following issues: 1. Thread-locality My biggest concern is the global scope of the numpy `current_allocator` variable. Currently, an allocator change is applied globally affecting every thread. This behavior breaks the local-scoped allocation promise of my project. Imagine for example the implications of allocating pinned (page-locked) memory (since you mention this use-case a lot) for random glue-code ndarrays in background threads. 2. Allocator context (already discussed) I found a bug, when I tried to use a Python callback (`ctypes.CFUNCTION`) for the `PyDataMem_FreeFunc` routine. Since there are cases in which the `free` routine is invoked after a PyErr has occurred (to clean up internal arrays for example), `ctypes` messes with the exception state badly. This problem can be resolved with the the use of a `ctx` (allocator context) that will allow the routines to run clean of errors, wrapping them like that: ``` static void wrapped_free(void *ptr, size_t size, void *ctx) { PyObject *type; PyObject *value; PyObject *traceback; PyErr_Fetch(&type, &value, &traceback); ((PyDataMem_Context *) ctx)->free(ptr, size); PyErr_Restore(type, value, traceback); } ``` Note: This bug doesn't affect `CDLL` members (CFuncPtr objects), since they are pure `dlsym` pointers. Of course, this is a simple case of how a `ctx` could be useful for an allocation policy. I guess people can become very creative with this in general. Elias -- Sent from: http://numpy-discussion.10968.n7.nabble.com/
Note that PEP-445 which introduced `PyMemAllocatorEx` specifically rejected omitting the `ctx` argument here: https://www.python.org/dev/peps/pep-0445/#id23, which is another argument in favor of having it. I'll try to give a more thorough justification for the pyobject / capsule suggestion in another message in the next few days. On Thu, 13 May 2021 at 17:06, eliaskoromilas <elias.koromilas@gmail.com> wrote:
Eric Wieser wrote
Yes, sorry, had been a while since I had looked it up:
https://docs.python.org/3/c-api/memory.html#c.PyMemAllocatorEx
That `PyMemAllocatorEx` looks almost exactly like one of the two variants I was proposing. Is there a reason for wanting to define our own structure vs just using that one? I think the NEP should at least offer a brief comparison to that structure, even if we ultimately end up not using it.
I have to say it feels a bit like exposing things publicly, that are really mainly used internally, but not sure... Presumably Python uses the `ctx` for something though.
I'd argue `ctx` / `baton` / `user_data` arguments are an essential part of any C callback API. I can't find any particularly good reference for this right now, but I have been bitten multiple times by C APIs that forget to add this argument.
If someone wants a different strategy (i.e. different alignment) they create a new policy
The crux of the problem here is that without very nasty hacks, C and C++ do not allow new functions to be created at runtime. This makes it very awkward to write a parameterizable allocator. If you want to create two aligned allocators with different alignments, and you don't have a `ctx` argument to plumb through that alignment information, you're forced to write the entire thing twice.
The `PyMemAllocatorEx` memory API will allow (lambda) closure-like definition of the data mem routines. That's the main idea behind the `ctx` thing, it's huge and will enable every allocation scenario.
In my opinion, the rest of the proposals (PyObjects, PyCapsules, etc.) are secondary and could be considered out-of-scope. I would suggest to let people use this before hiding it behind a strict API.
Let me also give you an insight of how we plan to do it, since we are the first to integrate this in production code. Considering this NEP as a primitive API, I developed a new project to address our requirements:
1. Provide a Python-native way to define a new numpy allocator 2. Accept data mem routine symbols (function pointers) from open dynamic libraries 3. Allow local-scoped allocation, e.g. inside a `with` statement
But since there was not much fun in these, I thought it would be nice if we could exploit `ctypes` callback functions, to allow developers hook into such routines natively (e.g. for debugging/monitoring), or even write them entirely in Python (of course there has to be an underlying memory allocation API).
For example, the idea is to be able to define a page-aligned allocator in ~30 lines of Python code, like that:
https://github.com/inaccel/numpy-allocator/blob/master/test/aligned_allocato...
---
While experimenting with this project I spotted the two following issues:
1. Thread-locality My biggest concern is the global scope of the numpy `current_allocator` variable. Currently, an allocator change is applied globally affecting every thread. This behavior breaks the local-scoped allocation promise of my project. Imagine for example the implications of allocating pinned (page-locked) memory (since you mention this use-case a lot) for random glue-code ndarrays in background threads.
2. Allocator context (already discussed) I found a bug, when I tried to use a Python callback (`ctypes.CFUNCTION`) for the `PyDataMem_FreeFunc` routine. Since there are cases in which the `free` routine is invoked after a PyErr has occurred (to clean up internal arrays for example), `ctypes` messes with the exception state badly. This problem can be resolved with the the use of a `ctx` (allocator context) that will allow the routines to run clean of errors, wrapping them like that:
``` static void wrapped_free(void *ptr, size_t size, void *ctx) { PyObject *type; PyObject *value; PyObject *traceback; PyErr_Fetch(&type, &value, &traceback); ((PyDataMem_Context *) ctx)->free(ptr, size); PyErr_Restore(type, value, traceback); } ```
Note: This bug doesn't affect `CDLL` members (CFuncPtr objects), since they are pure `dlsym` pointers.
Of course, this is a simple case of how a `ctx` could be useful for an allocation policy. I guess people can become very creative with this in general.
Elias
-- Sent from: http://numpy-discussion.10968.n7.nabble.com/ _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
The NEP [0] and the corresponding PR [1] have gone through another round of editing. I would like to restart the discussion here if anyone has more to add. Things that have changed since the last round: - The functions now accept a context argument - The code has been cleaned up for consistency - The language of the NEP has been tightened Thanks to all who have contributed to the discussion so far. Matti [0] https://numpy.org/neps/nep-0049.html [1] https://github.com/numpy/numpy/pull/17582
participants (5)
-
eliaskoromilas
-
Eric Wieser
-
leofang
-
Matti Picus
-
Sebastian Berg