Speedup by avoiding memory alloc twice in scalar array
Hi, I am working on performance parity between numpy scalar/small array and python array as GSOC mentored By Charles. Currently I am looking at PyArray_Return, which allocate separate memory just for scalar return. Unlike python which allocate memory once for returning result of scalar operations; numpy calls malloc twice once for the array object itself, and a second time for the array data. These memory allocations are happening in PyArray_NewFromDescr and PyArray_Scalar. Stashing both within a single allocation would be more efficient. In, PyArray_Scalar, new struct (PyLongScalarObject) need allocation in case of scalar arrays. Instead, can we just some how convert/cast PyArrayObject to PyLongScalarObject.?? -- Arink Verma www.arinkverma.in
On 16 Jul 2013 11:35, "Arink Verma" <arinkverma@gmail.com> wrote:
Hi,
I am working on performance parity between numpy scalar/small array and
python array as GSOC mentored By Charles.
Currently I am looking at PyArray_Return, which allocate separate memory
just for scalar return. Unlike python which allocate memory once for returning result of scalar operations; numpy calls malloc twice once for the array object itself, and a second time for the array data.
These memory allocations are happening in PyArray_NewFromDescr and
PyArray_Scalar. Stashing both within a single allocation would be more efficient.
In, PyArray_Scalar, new struct (PyLongScalarObject) need allocation in case of scalar arrays. Instead, can we just some how convert/cast PyArrayObject to PyLongScalarObject.??
I think there are more than 2 mallocs you're talking about here? Each ndarray does two mallocs, for the obj and buffer. These could be combined into 1 - just allocate the total size and do some pointer arithmetic, then set OWNDATA to false. Converting array to scalar does more allocations. I doubt there's a way to avoid these, but can't say for sure (on my phone now). In any case the idea of the project is to make scalars obsolete by making arrays competitive, right? So no need to go optimizing the competition ;-). (And more seriously, this slowdown *only* exists because of the array/scalar split, so ignoring it is fair.) In the bigger picture, these are pretty tiny optimizations, aren't they? In the quick profiling I did a while ago, it looked like there was a lot of much bigger low-hanging fruit, and fiddling around with one malloc versus two isn't going to do much if we're still wasting an order of magnitude more time in inefficient loop selection and unnecessary writes to the FP control word? -n
Each ndarray does two mallocs, for the obj and buffer. These could be combined into 1 - just allocate the total size and do some pointer arithmetic, then set OWNDATA to false. So, that two mallocs has been mentioned in project introduction. I got that wrong.
magnitude more time in inefficient loop selection and unnecessary writes to the FP control word? loop selection, contribute around 2~3% in time. I implemented cache with PyThreadState_GetDict() but it didnt help. Even generating prepopulated dict/list in code_generator/generate_umath.py is not helping,
Here, it the distribution of time, on addition operations. All memory related and BuildValue operations cost more than 7%, rest looping ones are around 2-3%: - PyUFunc_AddititonTypeResolver(7.6%) - *SimpleBinaryOperationTypeResolver(6.2%)* - *execute_legacy_ufunc_loop(20.7%)* - trivial_three_operand_loop(8.6%) ,this will be around 3.4% when pr # 3521 <https://github.com/numpy/numpy/pull/3521> get merged - *PYArray_NewFromDescr(7.3%)* - PyUFunc_DefaultLegacyInnerLoopSelector(2.5%) - PyUFunc_GetPyValues(12.0%) - *_extract_pyvals(9.2%)* - *PyArray_Return(14.3%)* -- Arink Verma www.arinkverma.in
On Tue, Jul 16, 2013 at 2:34 PM, Arink Verma <arinkverma@gmail.com> wrote:
Each ndarray does two mallocs, for the obj and buffer. These could be combined into 1 - just allocate the total size and do some pointer arithmetic, then set OWNDATA to false. So, that two mallocs has been mentioned in project introduction. I got that wrong.
On further thought/reading the code, it appears to be more complicated than that, actually. It looks like (for a non-scalar array) we have 2 calls to PyMem_Malloc: 1 for the array object itself, and one for the shapes + strides. And, one call to regular-old malloc: for the data buffer. (Mysteriously, shapes + strides together have 2*ndim elements, but to hold them we allocate a memory region sized to hold 3*ndim elements. I'm not sure why.) And contrary to what I said earlier, this is about as optimized as it can be without breaking ABI. We need at least 2 calls to malloc/PyMem_Malloc, because the shapes+strides may need to be resized without affecting the much larger data area. But it's tempting to allocate the array object and the data buffer in a single memory region, like I suggested earlier. And this would ALMOST work. But, it turns out there is code out there which assumes (whether wisely or not) that you can swap around which data buffer a given PyArrayObject refers to (hi Theano!). And supporting this means that data buffers and PyArrayObjects need to be in separate memory regions.
magnitude more time in inefficient loop selection and unnecessary writes to the FP control word? loop selection, contribute around 2~3% in time. I implemented cache with PyThreadState_GetDict() but it didnt help. Even generating prepopulated dict/list in code_generator/generate_umath.py is not helping,
Here, it the distribution of time, on addition operations. All memory related and BuildValue operations cost more than 7%, rest looping ones are around 2-3%:
- PyUFunc_AddititonTypeResolver(7.6%) - *SimpleBinaryOperationTypeResolver(6.2%)*
- *execute_legacy_ufunc_loop(20.7%)* - trivial_three_operand_loop(8.6%) ,this will be around 3.4% when pr # 3521 <https://github.com/numpy/numpy/pull/3521> get merged - *PYArray_NewFromDescr(7.3%)* - PyUFunc_DefaultLegacyInnerLoopSelector(2.5%)
- PyUFunc_GetPyValues(12.0%) - *_extract_pyvals(9.2%)* - *PyArray_Return(14.3%)*
Hmm, you prodded me into running those numbers again to see :-)
At http://www.arinkverma.in/2013/06/finding-bottleneck-in-pythonnumpy.htmlyou say that you're using a Python compiled with --with-pydebug. Is this true? If so then stop! You want numpy compiled with generic debugging information ("-g" on gcc), and maybe it helps to have Python compiled with "-g" as well. But --with-pydebug goes much further -- it actually changes the Python interpreter in many ways to add lots of expensive self-checks. On my machine simple operations like "[]" (allocate a list) or "1.0 + 1.0" go about 4x slower when I use Ubuntu's python-dbg package (which is compiled with --with-pydebug). You can't trust speed measurements you get from a --with-pydebug build. Anyway, I'm using 64-bit python2.7 from Ubuntu's repo, self-compiled numpy master, with this measurement code: import ctypes profiler = ctypes.CDLL("libprofiler.so.0") def loop(n): import numpy as np print "Numpy:", np.__version__ x = np.asarray([1.0, 2.0]) for i in xrange(n): x + x profiler.ProfilerStart("/tmp/master-array-float64-add.prof") loop(10000000) profiler.ProfilerStop() Graph attached. Notice: - because my benchmark has a 2-element array instead of a scalar array, the special-case scalar return logic (PyArray_Return etc.) disappears. This makes all percentages a bit higher in my graph, because the operation is overall faster. - PyArray_NewFromDescr does indeed take 11.6% of the time, but it's not clear why. Half that time is directly inside PyArray_NewFromDescr, not in any sub-calls to malloc-related functions. Also, you see a lot more time in array_alloc than I do, which may be caused by --with-pydebug. Taking a closer look with google-pprof --disasm=PyArray_NewFromDescr (also attached), it looks like the major cost here is, bizarrely enough, the calculation of the array size?! Out of 338 cumulative samples in this function, I count 175 that are associated with various div/mul instructions, while all the mallocs together take only 164 (= 5.6% of total time). This is pretty bizarre for a bunch of 1-dimensional 2-element arrays!? - PyUFunc_AdditionTypeResolver takes 10.9% of the time, and PyUFunc_DefaultLegacyInnerLoopSelector takes another 4.2% of the time, and this pretty absurd considering that we're talking about locating the float64 + float64 loop, which should not require any complicated logic. This should be like 0.1% or something. I'm not surprised that PyThreadState_GetDict() doesn't help -- doing dict lookups was probably was more expensive than the thing you replaced! But some sort of simple table lookup scheme that reduces loop lookup to chasing a few pointers should be totally doable. - We're spending 13.6% of the time in PyUFunc_getfperr. I'm pretty sure that a lot of this is totally wasted time, because we implement both 'set' and 'clear' operations as 'set+clear', making them twice as costly as necessary. (Eventually it would be even better if we could disable this logic entirely for integer arrays, and for when the user has turned off fp error reporting. But neither of these would help for this simple float+float benchmark.) - _extract_pyvals and PyUFunc_GetPyValues (not sure why they aren't linked in my graph, but they seem to be the same code) together use >11% of time. This is also completely silly -- all this time is spent on doing elaborate stuff to look up entries in a python dict, extract them, and convert them into, like, some C level bitmasks. And then doing that again and again on every operation. Instead we should convert this stuff to a C values once, when they're set in the first place, and stash those C values directly into a thread-local variable. See PyThread_*_key in pythread.h for a raw TLS implementation that's always available (and which is what PyThreadState_GetDict() is built on top of). The documentation is in the Python source distribution in comments in Python/thread.c. -n
Hi, On Tue, Jul 16, 2013 at 11:55 AM, Nathaniel Smith <njs@pobox.com> wrote:
On Tue, Jul 16, 2013 at 2:34 PM, Arink Verma <arinkverma@gmail.com> wrote:
Each ndarray does two mallocs, for the obj and buffer. These could be combined into 1 - just allocate the total size and do some pointer arithmetic, then set OWNDATA to false. So, that two mallocs has been mentioned in project introduction. I got that wrong.
On further thought/reading the code, it appears to be more complicated than that, actually.
It looks like (for a non-scalar array) we have 2 calls to PyMem_Malloc: 1 for the array object itself, and one for the shapes + strides. And, one call to regular-old malloc: for the data buffer.
(Mysteriously, shapes + strides together have 2*ndim elements, but to hold them we allocate a memory region sized to hold 3*ndim elements. I'm not sure why.)
And contrary to what I said earlier, this is about as optimized as it can be without breaking ABI. We need at least 2 calls to malloc/PyMem_Malloc, because the shapes+strides may need to be resized without affecting the much larger data area. But it's tempting to allocate the array object and the data buffer in a single memory region, like I suggested earlier. And this would ALMOST work. But, it turns out there is code out there which assumes (whether wisely or not) that you can swap around which data buffer a given PyArrayObject refers to (hi Theano!). And supporting this means that data buffers and PyArrayObjects need to be in separate memory regions.
Are you sure that Theano "swap" the data ptr of an ndarray? When we play with that, it is on a newly create ndarray. So a node in our graph, won't change the input ndarray structure. It will create a new ndarray structure with new shape/strides and pass a data ptr and we flag the new ndarray with own_data correctly to my knowledge. If Theano pose a problem here, I'll suggest that I fix Theano. But currently I don't see the problem. So if this make you change your mind about this optimization, tell me. I don't want Theano to prevent optimization in NumPy. Fred
On Tue, Jul 16, 2013 at 7:53 PM, Frédéric Bastien <nouiz@nouiz.org> wrote:
Hi,
On Tue, Jul 16, 2013 at 11:55 AM, Nathaniel Smith <njs@pobox.com> wrote:
On Tue, Jul 16, 2013 at 2:34 PM, Arink Verma <arinkverma@gmail.com> wrote:
arithmetic, then set OWNDATA to false. So, that two mallocs has been mentioned in project introduction. I got
Each ndarray does two mallocs, for the obj and buffer. These could be combined into 1 - just allocate the total size and do some pointer that wrong.
On further thought/reading the code, it appears to be more complicated than that, actually.
It looks like (for a non-scalar array) we have 2 calls to PyMem_Malloc: 1 for the array object itself, and one for the shapes + strides. And, one call to regular-old malloc: for the data buffer.
(Mysteriously, shapes + strides together have 2*ndim elements, but to hold them we allocate a memory region sized to hold 3*ndim elements. I'm not sure why.)
And contrary to what I said earlier, this is about as optimized as it can be without breaking ABI. We need at least 2 calls to malloc/PyMem_Malloc, because the shapes+strides may need to be resized without affecting the much larger data area. But it's tempting to allocate the array object and the data buffer in a single memory region, like I suggested earlier. And this would ALMOST work. But, it turns out there is code out there which assumes (whether wisely or not) that you can swap around which data buffer a given PyArrayObject refers to (hi Theano!). And supporting this means that data buffers and PyArrayObjects need to be in separate memory regions.
Are you sure that Theano "swap" the data ptr of an ndarray? When we play with that, it is on a newly create ndarray. So a node in our graph, won't change the input ndarray structure. It will create a new ndarray structure with new shape/strides and pass a data ptr and we flag the new ndarray with own_data correctly to my knowledge.
If Theano pose a problem here, I'll suggest that I fix Theano. But currently I don't see the problem. So if this make you change your mind about this optimization, tell me. I don't want Theano to prevent optimization in NumPy.
It's entirely possible I misunderstood, so let's see if we can work it out. I know that you want to assign to the ->data pointer in a PyArrayObject, right? That's what caused some trouble with the 1.7 API deprecations, which were trying to prevent direct access to this field? Creating a new array given a pointer to a memory region is no problem, and obviously will be supported regardless of any optimizations. But if that's all you were doing then you shouldn't have run into the deprecation problem. Or maybe I'm misremembering! The problem is if one wants to (a) create a PyArrayObject, which will by default allocate a new memory region and assign a pointer to it to the ->data field, and *then* (b) "steal" that memory region and replace it with another one, while keeping the same PyArrayObject. This is technically possible right now (though I wouldn't say it was necessarily a good idea!), but it would become impossible if we allocated the PyArrayObject and data into a single region. The profiles suggest that this would only make allocation of arrays maybe 15% faster, with probably a similar effect on deallocation. And I'm not sure how often array allocation per se is actually a bottleneck -- usually you also do things with the arrays, which is more expensive :-). But hey, 15% is nothing to sneeze at. -n
On Wed, Jul 17, 2013 at 10:39 AM, Nathaniel Smith <njs@pobox.com> wrote:
Hi,
On Tue, Jul 16, 2013 at 11:55 AM, Nathaniel Smith <njs@pobox.com> wrote:
On Tue, Jul 16, 2013 at 2:34 PM, Arink Verma <arinkverma@gmail.com>
wrote:
arithmetic, then set OWNDATA to false. So, that two mallocs has been mentioned in project introduction. I got
Each ndarray does two mallocs, for the obj and buffer. These could be combined into 1 - just allocate the total size and do some pointer that wrong.
On further thought/reading the code, it appears to be more complicated than that, actually.
It looks like (for a non-scalar array) we have 2 calls to PyMem_Malloc: 1 for the array object itself, and one for the shapes + strides. And, one call to regular-old malloc: for the data buffer.
(Mysteriously, shapes + strides together have 2*ndim elements, but to hold them we allocate a memory region sized to hold 3*ndim elements. I'm not sure why.)
And contrary to what I said earlier, this is about as optimized as it can be without breaking ABI. We need at least 2 calls to malloc/PyMem_Malloc, because the shapes+strides may need to be resized without affecting the much larger data area. But it's tempting to allocate the array object and the data buffer in a single memory region, like I suggested earlier. And
On Tue, Jul 16, 2013 at 7:53 PM, Frédéric Bastien <nouiz@nouiz.org> wrote: this
would ALMOST work. But, it turns out there is code out there which assumes (whether wisely or not) that you can swap around which data buffer a given PyArrayObject refers to (hi Theano!). And supporting this means that data buffers and PyArrayObjects need to be in separate memory regions.
Are you sure that Theano "swap" the data ptr of an ndarray? When we play with that, it is on a newly create ndarray. So a node in our graph, won't change the input ndarray structure. It will create a new ndarray structure with new shape/strides and pass a data ptr and we flag the new ndarray with own_data correctly to my knowledge.
If Theano pose a problem here, I'll suggest that I fix Theano. But currently I don't see the problem. So if this make you change your mind about this optimization, tell me. I don't want Theano to prevent optimization in NumPy.
It's entirely possible I misunderstood, so let's see if we can work it out. I know that you want to assign to the ->data pointer in a PyArrayObject, right? That's what caused some trouble with the 1.7 API deprecations, which were trying to prevent direct access to this field? Creating a new array given a pointer to a memory region is no problem, and obviously will be supported regardless of any optimizations. But if that's all you were doing then you shouldn't have run into the deprecation problem. Or maybe I'm misremembering!
What is currently done at only 1 place is to create a new PyArrayObject with a given ptr. So NumPy don't do the allocation. We later change that ptr to another one. It is the change to the ptr of the just created PyArrayObject that caused problem with the interface deprecation. I fixed all other problem releated to the deprecation (mostly just rename of function/macro). But I didn't fixed this one yet. I would need to change the logic to compute the final ptr before creating the PyArrayObject object and create it with the final data ptr. But in call cases, NumPy didn't allocated data memory for this object, so this case don't block your optimization. One thing in our optimization "wish list" is to reuse allocated PyArrayObject between Theano function call for intermediate results(so completly under Theano control). This could be useful in particular for reshape/transpose/subtensor. Those functions are pretty fast and from memory, I already found the allocation time was significant. But in those cases, it is on PyArrayObject that are views, so the metadata and the data would be in different memory region in all cases. The other cases of optimization "wish list" is if we want to reuse the PyArrayObject when the shape isn't the good one (but the number of dimensions is the same). If we do that for operation like addition, we will need to use PyArray_Resize(). This will be done on PyArrayObject whose data memory was allocated by NumPy. So if you do one memory allowcation for metadata and data, just make sure that PyArray_Resize() will handle that correctly. On the usefulness of doing only 1 memory allocation, on our old gpu ndarray, we where doing 2 alloc on the GPU, one for metadata and one for data. I removed this, as this was a bottleneck. allocation on the CPU are faster the on the GPU, but this is still something that is slow except if you reuse memory. Do PyMem_Malloc, reuse previous small allocation? For those that read up all this, the conclusion is that Theano should block this optimization. If you optimize the allocation of new PyArrayObject, they will be less incentive to do the "wish list" optimization. One last thing to keep in mind is that you should keep the data segment aligned. I would arg that alignment on the datatype size isn't enough, so I would suggest on cache line size or something like this. But I don't have number to base this one. This would also help in the case of resize that change the number of dimensions. Fred
On Wed, Jul 17, 2013 at 10:57 AM, Frédéric Bastien <nouiz@nouiz.org> wrote:
On Wed, Jul 17, 2013 at 10:39 AM, Nathaniel Smith <njs@pobox.com> wrote:
Hi,
On Tue, Jul 16, 2013 at 11:55 AM, Nathaniel Smith <njs@pobox.com> wrote:
On Tue, Jul 16, 2013 at 2:34 PM, Arink Verma <arinkverma@gmail.com>
wrote:
Each ndarray does two mallocs, for the obj and buffer. These could be combined into 1 - just allocate the total size and do some pointer >arithmetic, then set OWNDATA to false. So, that two mallocs has been mentioned in project introduction. I got that wrong.
On further thought/reading the code, it appears to be more complicated than that, actually.
It looks like (for a non-scalar array) we have 2 calls to PyMem_Malloc: 1 for the array object itself, and one for the shapes + strides. And, one call to regular-old malloc: for the data buffer.
(Mysteriously, shapes + strides together have 2*ndim elements, but to hold them we allocate a memory region sized to hold 3*ndim elements. I'm not sure why.)
And contrary to what I said earlier, this is about as optimized as it can be without breaking ABI. We need at least 2 calls to malloc/PyMem_Malloc, because the shapes+strides may need to be resized without affecting
larger data area. But it's tempting to allocate the array object and
data buffer in a single memory region, like I suggested earlier. And
On Tue, Jul 16, 2013 at 7:53 PM, Frédéric Bastien <nouiz@nouiz.org> wrote: the much the this
would ALMOST work. But, it turns out there is code out there which assumes (whether wisely or not) that you can swap around which data buffer a given PyArrayObject refers to (hi Theano!). And supporting this means that data buffers and PyArrayObjects need to be in separate memory regions.
Are you sure that Theano "swap" the data ptr of an ndarray? When we play with that, it is on a newly create ndarray. So a node in our graph, won't change the input ndarray structure. It will create a new ndarray structure with new shape/strides and pass a data ptr and we flag the new ndarray with own_data correctly to my knowledge.
If Theano pose a problem here, I'll suggest that I fix Theano. But currently I don't see the problem. So if this make you change your mind about this optimization, tell me. I don't want Theano to prevent optimization in NumPy.
It's entirely possible I misunderstood, so let's see if we can work it out. I know that you want to assign to the ->data pointer in a PyArrayObject, right? That's what caused some trouble with the 1.7 API deprecations, which were trying to prevent direct access to this field? Creating a new array given a pointer to a memory region is no problem, and obviously will be supported regardless of any optimizations. But if that's all you were doing then you shouldn't have run into the deprecation problem. Or maybe I'm misremembering!
What is currently done at only 1 place is to create a new PyArrayObject with a given ptr. So NumPy don't do the allocation. We later change that ptr to another one.
It is the change to the ptr of the just created PyArrayObject that caused problem with the interface deprecation. I fixed all other problem releated to the deprecation (mostly just rename of function/macro). But I didn't fixed this one yet. I would need to change the logic to compute the final ptr before creating the PyArrayObject object and create it with the final data ptr. But in call cases, NumPy didn't allocated data memory for this object, so this case don't block your optimization.
One thing in our optimization "wish list" is to reuse allocated PyArrayObject between Theano function call for intermediate results(so completly under Theano control). This could be useful in particular for reshape/transpose/subtensor. Those functions are pretty fast and from memory, I already found the allocation time was significant. But in those cases, it is on PyArrayObject that are views, so the metadata and the data would be in different memory region in all cases.
The other cases of optimization "wish list" is if we want to reuse the PyArrayObject when the shape isn't the good one (but the number of dimensions is the same). If we do that for operation like addition, we will need to use PyArray_Resize(). This will be done on PyArrayObject whose data memory was allocated by NumPy. So if you do one memory allowcation for metadata and data, just make sure that PyArray_Resize() will handle that correctly.
On the usefulness of doing only 1 memory allocation, on our old gpu ndarray, we where doing 2 alloc on the GPU, one for metadata and one for data. I removed this, as this was a bottleneck. allocation on the CPU are faster the on the GPU, but this is still something that is slow except if you reuse memory. Do PyMem_Malloc, reuse previous small allocation?
For those that read up all this, the conclusion is that Theano should block this optimization. If you optimize the allocation of new PyArrayObject, they will be less incentive to do the "wish list" optimization.
One last thing to keep in mind is that you should keep the data segment aligned. I would arg that alignment on the datatype size isn't enough, so I would suggest on cache line size or something like this. But I don't have number to base this one. This would also help in the case of resize that change the number of dimensions.
There is a similar thing done in f2py which is still keeping it from being current with the 1.7 macro replacement by functions. I'd like to add a 'swap' type function and would welcome discussion/implementation fo such. Chuck
On Wed, Jul 17, 2013 at 5:57 PM, Frédéric Bastien <nouiz@nouiz.org> wrote:
On Wed, Jul 17, 2013 at 10:39 AM, Nathaniel Smith <njs@pobox.com> wrote:
On Tue, Jul 16, 2013 at 11:55 AM, Nathaniel Smith <njs@pobox.com> wrote:
It's entirely possible I misunderstood, so let's see if we can work it out. I know that you want to assign to the ->data pointer in a PyArrayObject, right? That's what caused some trouble with the 1.7 API deprecations, which were trying to prevent direct access to this field? Creating a new array given a pointer to a memory region is no problem, and obviously will be supported regardless of any optimizations. But if that's all you were doing then you shouldn't have run into the deprecation problem. Or maybe I'm misremembering!
What is currently done at only 1 place is to create a new PyArrayObject with a given ptr. So NumPy don't do the allocation. We later change that ptr to another one.
Hmm, OK, so that would still work. If the array has the OWNDATA flag set (or you otherwise know where the data came from), then swapping the data pointer would still work. The change would be that in most cases when asking numpy to allocate a new array from scratch, the OWNDATA flag would not be set. That's because the OWNDATA flag really means "when this object is deallocated, call free(self->data)", but if we allocate the array struct and the data buffer together in a single memory region, then deallocating the object will automatically cause the data buffer to be deallocated as well, without the array destructor having to take any special effort.
It is the change to the ptr of the just created PyArrayObject that caused problem with the interface deprecation. I fixed all other problem releated to the deprecation (mostly just rename of function/macro). But I didn't fixed this one yet. I would need to change the logic to compute the final ptr before creating the PyArrayObject object and create it with the final data ptr. But in call cases, NumPy didn't allocated data memory for this object, so this case don't block your optimization.
Right.
One thing in our optimization "wish list" is to reuse allocated PyArrayObject between Theano function call for intermediate results(so completly under Theano control). This could be useful in particular for reshape/transpose/subtensor. Those functions are pretty fast and from memory, I already found the allocation time was significant. But in those cases, it is on PyArrayObject that are views, so the metadata and the data would be in different memory region in all cases.
The other cases of optimization "wish list" is if we want to reuse the PyArrayObject when the shape isn't the good one (but the number of dimensions is the same). If we do that for operation like addition, we will need to use PyArray_Resize(). This will be done on PyArrayObject whose data memory was allocated by NumPy. So if you do one memory allowcation for metadata and data, just make sure that PyArray_Resize() will handle that correctly.
I'm not sure I follow the details here, but it does turn out that a really surprising amount of time in PyArray_NewFromDescr is spent in just calculating and writing out the shape and strides buffers, so for programs that e.g. use hundreds of small 3-element arrays to represent points in space, re-using even these buffers might be a big win...
On the usefulness of doing only 1 memory allocation, on our old gpu ndarray, we where doing 2 alloc on the GPU, one for metadata and one for data. I removed this, as this was a bottleneck. allocation on the CPU are faster the on the GPU, but this is still something that is slow except if you reuse memory. Do PyMem_Malloc, reuse previous small allocation?
Yes, at least in theory PyMem_Malloc is highly-optimized for small buffer re-use. (For requests >256 bytes it just calls malloc().) And it's possible to define type-specific freelists; not sure if there's any value in doing that for PyArrayObjects. See Objects/obmalloc.c in the Python source tree. -n
On 18.07.2013 15:36, Nathaniel Smith wrote:
On Wed, Jul 17, 2013 at 5:57 PM, Frédéric Bastien <nouiz@nouiz.org> wrote:
On Wed, Jul 17, 2013 at 10:39 AM, Nathaniel Smith <njs@pobox.com> wrote:
On Tue, Jul 16, 2013 at 11:55 AM, Nathaniel Smith <njs@pobox.com> wrote:
It's entirely possible I misunderstood, so let's see if we can work it out. I know that you want to assign to the ->data pointer in a PyArrayObject, right? That's what caused some trouble with the 1.7 API deprecations, which were trying to prevent direct access to this field? Creating a new array given a pointer to a memory region is no problem, and obviously will be supported regardless of any optimizations. But if that's all you were doing then you shouldn't have run into the deprecation problem. Or maybe I'm misremembering!
What is currently done at only 1 place is to create a new PyArrayObject with a given ptr. So NumPy don't do the allocation. We later change that ptr to another one.
Hmm, OK, so that would still work. If the array has the OWNDATA flag set (or you otherwise know where the data came from), then swapping the data pointer would still work.
The change would be that in most cases when asking numpy to allocate a new array from scratch, the OWNDATA flag would not be set. That's because the OWNDATA flag really means "when this object is deallocated, call free(self->data)", but if we allocate the array struct and the data buffer together in a single memory region, then deallocating the object will automatically cause the data buffer to be deallocated as well, without the array destructor having to take any special effort.
It is the change to the ptr of the just created PyArrayObject that caused problem with the interface deprecation. I fixed all other problem releated to the deprecation (mostly just rename of function/macro). But I didn't fixed this one yet. I would need to change the logic to compute the final ptr before creating the PyArrayObject object and create it with the final data ptr. But in call cases, NumPy didn't allocated data memory for this object, so this case don't block your optimization.
Right.
One thing in our optimization "wish list" is to reuse allocated PyArrayObject between Theano function call for intermediate results(so completly under Theano control). This could be useful in particular for reshape/transpose/subtensor. Those functions are pretty fast and from memory, I already found the allocation time was significant. But in those cases, it is on PyArrayObject that are views, so the metadata and the data would be in different memory region in all cases.
The other cases of optimization "wish list" is if we want to reuse the PyArrayObject when the shape isn't the good one (but the number of dimensions is the same). If we do that for operation like addition, we will need to use PyArray_Resize(). This will be done on PyArrayObject whose data memory was allocated by NumPy. So if you do one memory allowcation for metadata and data, just make sure that PyArray_Resize() will handle that correctly.
I'm not sure I follow the details here, but it does turn out that a really surprising amount of time in PyArray_NewFromDescr is spent in just calculating and writing out the shape and strides buffers, so for programs that e.g. use hundreds of small 3-element arrays to represent points in space, re-using even these buffers might be a big win...
On the usefulness of doing only 1 memory allocation, on our old gpu ndarray, we where doing 2 alloc on the GPU, one for metadata and one for data. I removed this, as this was a bottleneck. allocation on the CPU are faster the on the GPU, but this is still something that is slow except if you reuse memory. Do PyMem_Malloc, reuse previous small allocation?
Yes, at least in theory PyMem_Malloc is highly-optimized for small buffer re-use. (For requests >256 bytes it just calls malloc().) And it's possible to define type-specific freelists; not sure if there's any value in doing that for PyArrayObjects. See Objects/obmalloc.c in the Python source tree.
-n
PyMem_Malloc is just a wrapper around malloc, so its only as optimized as the c library is (glibc is not good for small allocations). PyObject_Malloc uses a small object allocator for requests smaller 512 bytes (256 in python2). I filed a pull request [0] replacing a few functions which I think are safe to convert to this API. The nditer allocation which is completely encapsulated and the construction of the scalar and array python objects which are deleted via the tp_free slot (we really should not support third party libraries using PyMem_Free on python objects without checks). This already gives up to 15% improvements for scalar operations compared to glibc 2.17 malloc. Do I understand the discussions here right that we could replace PyDimMem_NEW which is used for strides in PyArray with the small object allocation too? It would still allow swapping the stride buffer, but every application must then delete it with PyDimMem_FREE which should be a reasonable requirement. [0] https://github.com/numpy/numpy/pull/4177
Hi, As told, I don't think Theano swap the stride buffer. Most of the time, we allocated with PyArray_empty or zeros. (not sure of the capitals). The only exception I remember have been changed in the last release to use PyArray_NewFromDescr(). Before that, we where allocating the PyArray with the right number of dimensions, then we where manually filling the ptr, shapes and strides. I don't recall any swapping of pointer for shapes and strides in Theano. So I don't see why Theano would prevent doing just one malloc for the struct and the shapes/strides. If it does, tell me and I'll fix Theano:) I don't want Theano to prevent optimization in NumPy. Theano now support completly the new NumPy C-API interface. Nathaniel also told that resizing the PyArray could prevent that. When Theano call PyArray_resize (not sure of the syntax), we always keep the number of dimensions the same. But I don't know if other code do differently. That could be a reason to keep separate alloc. I don't know any software that manually free the strides/shapes pointer to swap it. So I also think your suggestion to change PyDimMem_NEW to call the small allocator is good. The new interface prevent people from doing that anyway I think. Do we need to wait until we completly remove the old interface for this? Fred On Wed, Jan 8, 2014 at 1:13 PM, Julian Taylor <jtaylor.debian@googlemail.com> wrote:
On 18.07.2013 15:36, Nathaniel Smith wrote:
On Wed, Jul 17, 2013 at 5:57 PM, Frédéric Bastien <nouiz@nouiz.org> wrote:
On Wed, Jul 17, 2013 at 10:39 AM, Nathaniel Smith <njs@pobox.com> wrote:
On Tue, Jul 16, 2013 at 11:55 AM, Nathaniel Smith <njs@pobox.com> wrote:
It's entirely possible I misunderstood, so let's see if we can work it out. I know that you want to assign to the ->data pointer in a PyArrayObject, right? That's what caused some trouble with the 1.7 API deprecations, which were trying to prevent direct access to this field? Creating a new array given a pointer to a memory region is no problem, and obviously will be supported regardless of any optimizations. But if that's all you were doing then you shouldn't have run into the deprecation problem. Or maybe I'm misremembering!
What is currently done at only 1 place is to create a new PyArrayObject with a given ptr. So NumPy don't do the allocation. We later change that ptr to another one.
Hmm, OK, so that would still work. If the array has the OWNDATA flag set (or you otherwise know where the data came from), then swapping the data pointer would still work.
The change would be that in most cases when asking numpy to allocate a new array from scratch, the OWNDATA flag would not be set. That's because the OWNDATA flag really means "when this object is deallocated, call free(self->data)", but if we allocate the array struct and the data buffer together in a single memory region, then deallocating the object will automatically cause the data buffer to be deallocated as well, without the array destructor having to take any special effort.
It is the change to the ptr of the just created PyArrayObject that caused problem with the interface deprecation. I fixed all other problem releated to the deprecation (mostly just rename of function/macro). But I didn't fixed this one yet. I would need to change the logic to compute the final ptr before creating the PyArrayObject object and create it with the final data ptr. But in call cases, NumPy didn't allocated data memory for this object, so this case don't block your optimization.
Right.
One thing in our optimization "wish list" is to reuse allocated PyArrayObject between Theano function call for intermediate results(so completly under Theano control). This could be useful in particular for reshape/transpose/subtensor. Those functions are pretty fast and from memory, I already found the allocation time was significant. But in those cases, it is on PyArrayObject that are views, so the metadata and the data would be in different memory region in all cases.
The other cases of optimization "wish list" is if we want to reuse the PyArrayObject when the shape isn't the good one (but the number of dimensions is the same). If we do that for operation like addition, we will need to use PyArray_Resize(). This will be done on PyArrayObject whose data memory was allocated by NumPy. So if you do one memory allowcation for metadata and data, just make sure that PyArray_Resize() will handle that correctly.
I'm not sure I follow the details here, but it does turn out that a really surprising amount of time in PyArray_NewFromDescr is spent in just calculating and writing out the shape and strides buffers, so for programs that e.g. use hundreds of small 3-element arrays to represent points in space, re-using even these buffers might be a big win...
On the usefulness of doing only 1 memory allocation, on our old gpu ndarray, we where doing 2 alloc on the GPU, one for metadata and one for data. I removed this, as this was a bottleneck. allocation on the CPU are faster the on the GPU, but this is still something that is slow except if you reuse memory. Do PyMem_Malloc, reuse previous small allocation?
Yes, at least in theory PyMem_Malloc is highly-optimized for small buffer re-use. (For requests >256 bytes it just calls malloc().) And it's possible to define type-specific freelists; not sure if there's any value in doing that for PyArrayObjects. See Objects/obmalloc.c in the Python source tree.
-n
PyMem_Malloc is just a wrapper around malloc, so its only as optimized as the c library is (glibc is not good for small allocations). PyObject_Malloc uses a small object allocator for requests smaller 512 bytes (256 in python2).
I filed a pull request [0] replacing a few functions which I think are safe to convert to this API. The nditer allocation which is completely encapsulated and the construction of the scalar and array python objects which are deleted via the tp_free slot (we really should not support third party libraries using PyMem_Free on python objects without checks).
This already gives up to 15% improvements for scalar operations compared to glibc 2.17 malloc. Do I understand the discussions here right that we could replace PyDimMem_NEW which is used for strides in PyArray with the small object allocation too? It would still allow swapping the stride buffer, but every application must then delete it with PyDimMem_FREE which should be a reasonable requirement.
[0] https://github.com/numpy/numpy/pull/4177
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On Wed, Jan 8, 2014 at 12:13 PM, Julian Taylor <jtaylor.debian@googlemail.com> wrote:
On 18.07.2013 15:36, Nathaniel Smith wrote:
On Wed, Jul 17, 2013 at 5:57 PM, Frédéric Bastien <nouiz@nouiz.org> wrote:
On the usefulness of doing only 1 memory allocation, on our old gpu ndarray, we where doing 2 alloc on the GPU, one for metadata and one for data. I removed this, as this was a bottleneck. allocation on the CPU are faster the on the GPU, but this is still something that is slow except if you reuse memory. Do PyMem_Malloc, reuse previous small allocation?
Yes, at least in theory PyMem_Malloc is highly-optimized for small buffer re-use. (For requests >256 bytes it just calls malloc().) And it's possible to define type-specific freelists; not sure if there's any value in doing that for PyArrayObjects. See Objects/obmalloc.c in the Python source tree.
PyMem_Malloc is just a wrapper around malloc, so its only as optimized as the c library is (glibc is not good for small allocations). PyObject_Malloc uses a small object allocator for requests smaller 512 bytes (256 in python2).
Right, I meant PyObject_Malloc of course.
I filed a pull request [0] replacing a few functions which I think are safe to convert to this API. The nditer allocation which is completely encapsulated and the construction of the scalar and array python objects which are deleted via the tp_free slot (we really should not support third party libraries using PyMem_Free on python objects without checks).
This already gives up to 15% improvements for scalar operations compared to glibc 2.17 malloc. Do I understand the discussions here right that we could replace PyDimMem_NEW which is used for strides in PyArray with the small object allocation too? It would still allow swapping the stride buffer, but every application must then delete it with PyDimMem_FREE which should be a reasonable requirement.
That sounds reasonable to me. If we wanted to get even more elaborate, we could by default stick the shape/strides into the same allocation as the PyArrayObject, and then defer allocating a separate buffer until someone actually calls PyArray_Resize. (With a new flag, similar to OWNDATA, that tells us whether we need to free the shape/stride buffer when deallocating the array.) It's got to be a vanishingly small proportion of arrays where PyArray_Resize is actually called, so for most arrays, this would let us skip the allocation entirely, and the only cost would be that for arrays where PyArray_Resize *is* called to add new dimensions, we'd leave the original buffers sitting around until the array was freed, wasting a tiny amount of memory. Given that no-one has noticed that currently *every* array wastes 50% of this much memory (see upthread), I doubt anyone will care... -- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org
On Wed, Jan 8, 2014 at 3:40 PM, Nathaniel Smith <njs@pobox.com> wrote:
On Wed, Jan 8, 2014 at 12:13 PM, Julian Taylor <jtaylor.debian@googlemail.com> wrote:
On 18.07.2013 15:36, Nathaniel Smith wrote:
On Wed, Jul 17, 2013 at 5:57 PM, Frédéric Bastien <nouiz@nouiz.org> wrote:
On the usefulness of doing only 1 memory allocation, on our old gpu ndarray, we where doing 2 alloc on the GPU, one for metadata and one for data. I removed this, as this was a bottleneck. allocation on the CPU are faster the on the GPU, but this is still something that is slow except if you reuse memory. Do PyMem_Malloc, reuse previous small allocation?
Yes, at least in theory PyMem_Malloc is highly-optimized for small buffer re-use. (For requests >256 bytes it just calls malloc().) And it's possible to define type-specific freelists; not sure if there's any value in doing that for PyArrayObjects. See Objects/obmalloc.c in the Python source tree.
PyMem_Malloc is just a wrapper around malloc, so its only as optimized as the c library is (glibc is not good for small allocations). PyObject_Malloc uses a small object allocator for requests smaller 512 bytes (256 in python2).
Right, I meant PyObject_Malloc of course.
I filed a pull request [0] replacing a few functions which I think are safe to convert to this API. The nditer allocation which is completely encapsulated and the construction of the scalar and array python objects which are deleted via the tp_free slot (we really should not support third party libraries using PyMem_Free on python objects without checks).
This already gives up to 15% improvements for scalar operations compared to glibc 2.17 malloc. Do I understand the discussions here right that we could replace PyDimMem_NEW which is used for strides in PyArray with the small object allocation too? It would still allow swapping the stride buffer, but every application must then delete it with PyDimMem_FREE which should be a reasonable requirement.
That sounds reasonable to me.
If we wanted to get even more elaborate, we could by default stick the shape/strides into the same allocation as the PyArrayObject, and then defer allocating a separate buffer until someone actually calls PyArray_Resize. (With a new flag, similar to OWNDATA, that tells us whether we need to free the shape/stride buffer when deallocating the array.) It's got to be a vanishingly small proportion of arrays where PyArray_Resize is actually called, so for most arrays, this would let us skip the allocation entirely, and the only cost would be that for arrays where PyArray_Resize *is* called to add new dimensions, we'd leave the original buffers sitting around until the array was freed, wasting a tiny amount of memory. Given that no-one has noticed that currently *every* array wastes 50% of this much memory (see upthread), I doubt anyone will care...
Seam a good plan. When is it planed to remove the old interface? We can't do it before I think. Fred
participants (5)
-
Arink Verma
-
Charles R Harris
-
Frédéric Bastien
-
Julian Taylor
-
Nathaniel Smith