numpy arrays, data allocation and SIMD alignement
Hi, Following an ongoing discussion with S. Johnson, one of the developer of fftw3, I would be interested in what people think about adding infrastructure in numpy related to SIMD alignement (that is 16 bytes alignement for SSE/ALTIVEC, I don't know anything about other archs). The problem is that right now, it is difficult to get information for alignement in numpy (by alignement here, I mean something different than what is normally meant in numpy context; whether, in my understanding, NPY_ALIGNED refers to a pointer which is aligned wrt his type, here, I am talking about arbitrary alignement). For example, for fftw3, we need to know whether a given data buffer is 16 bytes aligned to get optimal performances; generally, SSE needs 16 byte alignement for optimal performances, as well as altivec. I think it would be nice to get some infrastructure to help developers to get those kind of information, and maybe to be able to request 16 aligned buffers. Here is what I can think of: - adding an API to know whether a given PyArrayObject has its data buffer 16 bytes aligned, and requesting a 16 bytes aligned PyArrayObject. Something like NPY_ALIGNED, basically. - forcing data allocation to be 16 bytes aligned in numpy (eg define PyDataMem_Mem to a 16 bytes aligned allocator instead of malloc). This would mean that many arrays would be "naturally" 16 bytes aligned without effort. Point 2 is really easy to implement I think: actually, on some platforms (Mac OS X and FreeBSD), malloc returning 16 bytes aligned buffers anyway, so I don't think the wasted space is a real problem. Linux with glibc is 8 bytes aligned, I don't know about windows. Implementing our own 16 bytes aligned memory allocator for cross platform compatibility should be relatively easy. I don't see any drawback, but I guess other people will. Point 1 is more tricky, as this requires much more changes in the code. Do main developers of numpy have an opinion on this ? cheers, David
Hi,
Following an ongoing discussion with S. Johnson, one of the developer of fftw3, I would be interested in what people think about adding infrastructure in numpy related to SIMD alignement (that is 16 bytes alignement for SSE/ALTIVEC, I don't know anything about other archs). The problem is that right now, it is difficult to get information for alignement in numpy (by alignement here, I mean something different than what is normally meant in numpy context; whether, in my understanding, NPY_ALIGNED refers to a pointer which is aligned wrt his type, here, I am talking about arbitrary alignement).
Excellent idea if practical... Matthew
Dear David, Both ideas, particularly the 2nd, would be excellent additions to numpy. I often use the Intel IPP (Integrated Performance Primitives) Library together with numpy, but I have to do all my memory allocation with the IPP to ensure fastest operation. I then create numpy views of the data. All this works brilliantly, but it would be really nice if I could allocate the memory directly in numpy. IPP allocates, and says it wants, 32 byte aligned memory (see, e.g. http://www.intel.com/support/performancetools/sb/CS-021418.htm ). Given that fftw3 apparently wants 16 byte aligned memory, my feeling is that, if the effort is made, the alignment width should be specified at run-time, rather than hard-coded. In terms of implementation of your 1st point, I'm not aware of how much effort your idea would take (and it does sound nice), but some benefit would be had just from a simple function numpy.is_mem_aligned( ndarray, width=16 ) which returns a bool. Cheers! Andrew David Cournapeau wrote:
Hi,
Following an ongoing discussion with S. Johnson, one of the developer of fftw3, I would be interested in what people think about adding infrastructure in numpy related to SIMD alignement (that is 16 bytes alignement for SSE/ALTIVEC, I don't know anything about other archs). The problem is that right now, it is difficult to get information for alignement in numpy (by alignement here, I mean something different than what is normally meant in numpy context; whether, in my understanding, NPY_ALIGNED refers to a pointer which is aligned wrt his type, here, I am talking about arbitrary alignement). For example, for fftw3, we need to know whether a given data buffer is 16 bytes aligned to get optimal performances; generally, SSE needs 16 byte alignement for optimal performances, as well as altivec. I think it would be nice to get some infrastructure to help developers to get those kind of information, and maybe to be able to request 16 aligned buffers. Here is what I can think of: - adding an API to know whether a given PyArrayObject has its data buffer 16 bytes aligned, and requesting a 16 bytes aligned PyArrayObject. Something like NPY_ALIGNED, basically. - forcing data allocation to be 16 bytes aligned in numpy (eg define PyDataMem_Mem to a 16 bytes aligned allocator instead of malloc). This would mean that many arrays would be "naturally" 16 bytes aligned without effort.
Point 2 is really easy to implement I think: actually, on some platforms (Mac OS X and FreeBSD), malloc returning 16 bytes aligned buffers anyway, so I don't think the wasted space is a real problem. Linux with glibc is 8 bytes aligned, I don't know about windows. Implementing our own 16 bytes aligned memory allocator for cross platform compatibility should be relatively easy. I don't see any drawback, but I guess other people will.
Point 1 is more tricky, as this requires much more changes in the code.
Do main developers of numpy have an opinion on this ?
cheers,
David
_______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Dear David,
Both ideas, particularly the 2nd, would be excellent additions to numpy. I often use the Intel IPP (Integrated Performance Primitives) Library together with numpy, but I have to do all my memory allocation with the IPP to ensure fastest operation. I then create numpy views of the data. All this works brilliantly, but it would be really nice if I could allocate the memory directly in numpy.
IPP allocates, and says it wants, 32 byte aligned memory (see, e.g. http://www.intel.com/support/performancetools/sb/CS-021418.htm ). Given that fftw3 apparently wants 16 byte aligned memory, my feeling is that, if the effort is made, the alignment width should be specified at run-time, rather than hard-coded. I think that doing it at runtime would be overkill, no ? I was thinking about making it a compile option. Generally, at the ASM level, you need 16 bytes alignment (for instructions like movaps, which takes 16 bytes in memory and put it in the SSE registers), this is not just fftw. Maybe
Andrew Straw wrote: the 32 bytes alignment is useful for cache reasons, I don't know. I don't think it would be difficult to implement and validate; what I don't know at all is the implication of this at the binary level, if any. cheers, David
On 8/3/07, David Cournapeau <david@ar.media.kyoto-u.ac.jp> wrote:
Dear David,
Both ideas, particularly the 2nd, would be excellent additions to numpy. I often use the Intel IPP (Integrated Performance Primitives) Library together with numpy, but I have to do all my memory allocation with the IPP to ensure fastest operation. I then create numpy views of the data. All this works brilliantly, but it would be really nice if I could allocate the memory directly in numpy.
IPP allocates, and says it wants, 32 byte aligned memory (see, e.g. http://www.intel.com/support/performancetools/sb/CS-021418.htm ). Given that fftw3 apparently wants 16 byte aligned memory, my feeling is that, if the effort is made, the alignment width should be specified at run-time, rather than hard-coded. I think that doing it at runtime would be overkill, no ? I was thinking about making it a compile option. Generally, at the ASM level, you need 16 bytes alignment (for instructions like movaps, which takes 16 bytes in memory and put it in the SSE registers), this is not just fftw. Maybe
Andrew Straw wrote: the 32 bytes alignment is useful for cache reasons, I don't know.
I don't think it would be difficult to implement and validate; what I don't know at all is the implication of this at the binary level, if any.
Here's a hack that google turned up: (1) Use static variables instead of dynamic (stack) variables (2) Use in-line assembly code that explicitly aligns data (3) In C code, use "*malloc*" to explicitly allocate variables Here is Intel's example of (2): ; procedure prologue push ebp mov esp, ebp and ebp, -8 sub esp, 12 ; procedure epilogue add esp, 12 pop ebp ret Intel's example of (3), slightly modified: double *p, *newp; p = (double*)*malloc* ((sizeof(double)*NPTS)+4); newp = (p+4) & (~7); This assures that newp is 8-*byte* aligned even if p is not. However, *malloc*() may already follow Intel's recommendation that a *32*-*byte* or greater data structures be aligned on a *32* *byte* boundary. In that case, increasing the requested memory by 4 bytes and computing newp are superfluous. Chuck
On 8/3/07, Charles R Harris <charlesr.harris@gmail.com> wrote:
On 8/3/07, David Cournapeau <david@ar.media.kyoto-u.ac.jp> wrote:
Dear David,
Both ideas, particularly the 2nd, would be excellent additions to numpy. I often use the Intel IPP (Integrated Performance Primitives) Library together with numpy, but I have to do all my memory allocation with
IPP to ensure fastest operation. I then create numpy views of the data. All this works brilliantly, but it would be really nice if I could allocate the memory directly in numpy.
IPP allocates, and says it wants, 32 byte aligned memory (see, e.g. http://www.intel.com/support/performancetools/sb/CS-021418.htm ). Given that fftw3 apparently wants 16 byte aligned memory, my feeling is
if the effort is made, the alignment width should be specified at run-time, rather than hard-coded. I think that doing it at runtime would be overkill, no ? I was thinking about making it a compile option. Generally, at the ASM level, you need 16 bytes alignment (for instructions like movaps, which takes 16 bytes in memory and put it in the SSE registers), this is not just fftw. Maybe
Andrew Straw wrote: the that, the 32 bytes alignment is useful for cache reasons, I don't know.
I don't think it would be difficult to implement and validate; what I don't know at all is the implication of this at the binary level, if any.
Here's a hack that google turned up:
(1) Use static variables instead of dynamic (stack) variables (2) Use in-line assembly code that explicitly aligns data (3) In C code, use "*malloc*" to explicitly allocate variables
Here is Intel's example of (2):
; procedure prologue push ebp mov esp, ebp and ebp, -8 sub esp, 12
; procedure epilogue add esp, 12 pop ebp ret
Intel's example of (3), slightly modified:
double *p, *newp; p = (double*)*malloc* ((sizeof(double)*NPTS)+4); newp = (p+4) & (~7);
This assures that newp is 8-*byte* aligned even if p is not. However, *malloc*() may already follow Intel's recommendation that a *32*-* byte*or greater data structures be aligned on a * 32* *byte* boundary. In that case, increasing the requested memory by 4 bytes and computing newp are superfluous.
I think that for numpy arrays it should be possible to define the offset so that the result is 32 byte aligned. However, this might break some peoples' code if they haven't payed attention to the offset. Another possibility is to allocate an oversized array, check the pointer, and take a range out of it. For instance: In [32]: a = zeros(10) In [33]: a.ctypes.data % 32 Out[33]: 16 The array alignment is 16 bytes, consequently In [34]: a[2:].ctypes.data % 32 Out[34]: 0 Voila, 32 byte alignment. I think a short python routine could do this, which ought to serve well for 1D fft's. Multidimensional arrays will be trickier if you want the rows to be aligned. Aligning the columns just isn't going to work. Chuck
Here's a hack that google turned up:
(1) Use static variables instead of dynamic (stack) variables (2) Use in-line assembly code that explicitly aligns data (3) In C code, use "*malloc*" to explicitly allocate variables
Here is Intel's example of (2):
; procedure prologue push ebp mov esp, ebp and ebp, -8 sub esp, 12
; procedure epilogue add esp, 12 pop ebp ret
Intel's example of (3), slightly modified:
double *p, *newp; p = (double*)*malloc* ((sizeof(double)*NPTS)+4); newp = (p+4) & (~7);
This assures that newp is 8-*byte* aligned even if p is not. However, *malloc*() may already follow Intel's recommendation that a *32*-* byte* or greater data structures be aligned on a *32* *byte* boundary. In that case, increasing the requested memory by 4 bytes and computing newp are superfluous.
I think that for numpy arrays it should be possible to define the offset so that the result is 32 byte aligned. However, this might break some peoples' code if they haven't payed attention to the offset.
Why ? I really don't see how it can break anything at the source code level. You don't have to care about things you didn't care before: the best proof of that if that numpy runs on different platforms where the malloc has different alignment guarantees (mac OS X already aligned to 16 bytes, for the very reason of making optimizing with SIMD easier, whereas glibc malloc only aligns to 8 bytes, at least on Linux).
Another possibility is to allocate an oversized array, check the pointer, and take a range out of it. For instance:
In [32]: a = zeros(10)
In [33]: a.ctypes.data % 32 Out[33]: 16
The array alignment is 16 bytes, consequently
In [34]: a[2:].ctypes.data % 32 Out[34]: 0
Voila, 32 byte alignment. I think a short python routine could do this, which ought to serve well for 1D fft's. Multidimensional arrays will be trickier if you want the rows to be aligned. Aligning the columns just isn't going to work. I am not suggesting realigning existing arrays. What I would like numpy to support are the following cases:
- Check whether a given a numpy array is simd aligned: /* Simple case: if aligned, use optimized func, use non optimized otherwise */ int simd_func(double* in, size_t n); int nosimd_func(double* in, size_t n); if (PyArray_ISALIGNED_SIMD(a)) { simd_func((double *)a->data, a->size); } else { nosimd_func((double *)a->data, a->size); } - Request explicitely an aligned arrays from any PyArray_* functions which create a ndarray, eg: ar = PyArray_FROM_OF(a, NPY_SIMD_ALIGNED); Allocating a buffer aligned to a given alignment is not the problem: there is a posix functions to do it, and we can implement easily a function for the OS who do not support it. This would be done in C, not in python. cheers, David
On 04/08/07, David Cournapeau <david@ar.media.kyoto-u.ac.jp> wrote:
Here's a hack that google turned up:
I'd avoid hacks in favour of posix_memalign (which allows arbitrary degrees of alignment. For one thing, freeing becomes a headache (you can't free a pointer you've jiggered!).
- Check whether a given a numpy array is simd aligned:
/* Simple case: if aligned, use optimized func, use non optimized otherwise */ int simd_func(double* in, size_t n); int nosimd_func(double* in, size_t n);
if (PyArray_ISALIGNED_SIMD(a)) { simd_func((double *)a->data, a->size); } else { nosimd_func((double *)a->data, a->size); } - Request explicitely an aligned arrays from any PyArray_* functions which create a ndarray, eg: ar = PyArray_FROM_OF(a, NPY_SIMD_ALIGNED);
Allocating a buffer aligned to a given alignment is not the problem: there is a posix functions to do it, and we can implement easily a function for the OS who do not support it. This would be done in C, not in python.
I'd just like to point out that PyArray_ISALIGNED_SIMD(a) can be a macro which aligns to something like "!((a->datapointer)&0xf)"; this avoids any change to the array objects and allows checking for arbitrary degrees of alignment - somebody mentioned the Intel Performance Primitives need 32-byte aligned data? One might also want page-aligned data or data aligned in some way with cache lines. It seems to me two things are needed: * A mechanism for requesting numpy arrays with buffers aligned to an arbitrary power-of-two size (basically just using posix_memalign or some horrible hack on platforms that don't have it). * A macro (in C, and some way to get the same information from python, perhaps just "a.ctypes.data % 16") to test for common alignment cases; SIMD alignment and arbitrary power-of-two alignment are probably sufficient. Does this fail to cover any important cases? Anne
On Aug 4, 3:24 am, "Anne Archibald" <peridot.face...@gmail.com> wrote:
It seems to me two things are needed:
* A mechanism for requesting numpy arrays with buffers aligned to an arbitrary power-of-two size (basically just using posix_memalign or some horrible hack on platforms that don't have it).
Right, you might as well allow the alignment (to a power-of-two size) to be specified at runtime, as there is really no cost to implementing an arbitrary alignment once you have any alignment. Although you should definitely use posix_memalign (or the old memalign) where it is available, unfortunately it's not implemented on all systems. e.g. MacOS X and FreeBSD don't have it, last I checked (although in both cases their malloc is 16-byte aligned). Microsoft VC ++ has a function called _aligned_malloc which is equivalent. However, since MinGW (www.mingw.org) didn't have an _aligned_malloc function, I wrote one for them a few years ago and put it in the public domain (I use MinGW to cross-compile to Windows from Linux and need the alignment). You are free to use it as a fallback on systems that don't have a memalign function if you want. It should work on any system where sizeof(void*) is a power of two (i.e. every extant architecture, that I know of). You can download it and its test program from: ab-initio.mit.edu/~stevenj/align.c ab-initio.mit.edu/~stevenj/tstalign.c It just uses malloc with a little extra padding as needed to align the data, plus a copy of the original pointer so that you can still free and realloc (using _aligned_free and _aligned_realloc). It could be made a bit more efficient, but it probably doesn't matter.
* A macro (in C, and some way to get the same information from python, perhaps just "a.ctypes.data % 16") to test for common alignment cases; SIMD alignment and arbitrary power-of-two alignment are probably sufficient.
In C this is easy, just ((uintptr_t) pointer) % 16 == 0. You might also consider a way to set the default alignment of numpy arrays at runtime, rather than requesting aligned arrays individually. e.g. so that someone could come along at a later date to a large program and just add one function call to make all the arrays 16-byte aligned to improve performance using SIMD libraries. Regards, Steven G. Johnson
On 8/3/07, David Cournapeau <david@ar.media.kyoto-u.ac.jp> wrote:
Here is what I can think of: - adding an API to know whether a given PyArrayObject has its data buffer 16 bytes aligned, and requesting a 16 bytes aligned PyArrayObject. Something like NPY_ALIGNED, basically. - forcing data allocation to be 16 bytes aligned in numpy (eg define PyDataMem_Mem to a 16 bytes aligned allocator instead of malloc).
All this sounds pretty similar to sdt::allocator we can found in C++ STL (http://www.sgi.com/tech/stl/Allocators.html). Perhaps a NumPy array could be associated with an instance of an 'allocator' object (surely written in C, perhaps subclassable in Python) providing appropriate methos for alloc/dealloc(/realloc?/initialize(memset)?/copy(memcpy)?) memory. This would be really nice, as it is extensible (you could even write a custom allocator, perhaps making use of a preallocated,static pool; use of C++ new/delete; use of any C++ std::allocator, shared memory, etc. etc.). I think this is the direction to go but no idea how much difficult it could be to implement. -- Lisandro Dalcín --------------- Centro Internacional de Métodos Computacionales en Ingeniería (CIMEC) Instituto de Desarrollo Tecnológico para la Industria Química (INTEC) Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET) PTLC - Güemes 3450, (3000) Santa Fe, Argentina Tel/Fax: +54-(0)342-451.1594
Lisandro Dalcin wrote:
On 8/3/07, David Cournapeau <david@ar.media.kyoto-u.ac.jp> wrote:
Here is what I can think of: - adding an API to know whether a given PyArrayObject has its data buffer 16 bytes aligned, and requesting a 16 bytes aligned PyArrayObject. Something like NPY_ALIGNED, basically. - forcing data allocation to be 16 bytes aligned in numpy (eg define PyDataMem_Mem to a 16 bytes aligned allocator instead of malloc).
All this sounds pretty similar to sdt::allocator we can found in C++ STL (http://www.sgi.com/tech/stl/Allocators.html). Perhaps a NumPy array could be associated with an instance of an 'allocator' object (surely written in C, perhaps subclassable in Python) providing appropriate methos for alloc/dealloc(/realloc?/initialize(memset)?/copy(memcpy)?) memory.
This would be really nice, as it is extensible (you could even write a custom allocator, perhaps making use of a preallocated,static pool; use of C++ new/delete; use of any C++ std::allocator, shared memory, etc. etc.). I think this is the direction to go but no idea how much difficult it could be to implement.
Well, when I proposed the SIMD extension, I was willing to implement the proposal, and this was for a simple goal: enabling better integration with many numeric libraries which need SIMD alignment. As nice as a custom allocator might be, I will certainly not implement it myself. For SIMD, I think the weight adding complexity / benefit worth it (since there is not much change to the API and implementation), and I know more or less how to do it; for custom allocator, that's an entirely different story. That's really more complex; static pools may be useful in some cases (but that's not obvious, since only the data are allocated with this buffer, everything else being allocated through the python memory allocator, and numpy arrays have pretty simple memory allocation patterns). David
On 06/08/07, David Cournapeau <david@ar.media.kyoto-u.ac.jp> wrote:
Well, when I proposed the SIMD extension, I was willing to implement the proposal, and this was for a simple goal: enabling better integration with many numeric libraries which need SIMD alignment.
As nice as a custom allocator might be, I will certainly not implement it myself. For SIMD, I think the weight adding complexity / benefit worth it (since there is not much change to the API and implementation), and I know more or less how to do it; for custom allocator, that's an entirely different story. That's really more complex; static pools may be useful in some cases (but that's not obvious, since only the data are allocated with this buffer, everything else being allocated through the python memory allocator, and numpy arrays have pretty simple memory allocation patterns).
I have to agree. I can hardly volunteer David for anything, and I don't have time to implement this myself, but I think a custom allocator is a rather special-purpose tool; if one were to implement one, I think the way to go would be to implement a subclass of ndarray (or just a constructor) that allocated the memory. This could be done from python, since you can make an ndarray from scratch using a given memory array. Of course, making temporaries be allocated with the correct allocator will be very complicated, since it's unclear which allocator should be used. Adding SIMD alignment should be a very small modification; it can be done as simply as using ctypes to wrap posix_memalign (or a portable version, possibly written in python) and writing a simple python function that checks the beginning data address. There's really no need to make it complicated. Anne
Anne Archibald wrote:
I have to agree. I can hardly volunteer David for anything, and I don't have time to implement this myself, but I think a custom allocator is a rather special-purpose tool; if one were to implement one, I think the way to go would be to implement a subclass of ndarray (or just a constructor) that allocated the memory. This could be done from python, since you can make an ndarray from scratch using a given memory array. Of course, making temporaries be allocated with the correct allocator will be very complicated, since it's unclear which allocator should be used.
Adding SIMD alignment should be a very small modification; it can be done as simply as using ctypes to wrap posix_memalign (or a portable version, possibly written in python) and writing a simple python function that checks the beginning data address. There's really no need to make it complicated.
Anne, you said previously that it was easy to allocate buffers for a given alignment at runtime. Could you point me to a document which explains how ? For platforms without posix_memalign, I don't see how to implement a memory allocator with an arbitrary alignment (more precisely, I don't see how to free it if I cannot assume a fixed alignement: how do I know where the "real" pointer is ?). David
On 07/08/07, David Cournapeau <david@ar.media.kyoto-u.ac.jp> wrote:
Anne, you said previously that it was easy to allocate buffers for a given alignment at runtime. Could you point me to a document which explains how ? For platforms without posix_memalign, I don't see how to implement a memory allocator with an arbitrary alignment (more precisely, I don't see how to free it if I cannot assume a fixed alignement: how do I know where the "real" pointer is ?).
Well, it can be done in Python: just allocate a too-big ndarray and take a slice that's the right shape and has the right alignment. But this sucks. Stephen G. Johnson posted code earlier in this thread that provides a portable aligned-memory allocator - it handles the freeing by (always) storing enough information to recover the original pointer in the padding space. (This means you always need to pad, which is a pain, but there's not much you can do about that.) His implementation stores the original pointer just before the beginning of the aligned data, so _aligned_free is free(((void**)ptr)[-1]). If you were worried about space (rather than time) you could store a single byte just before the pointer whose value indicated how much padding was done, or whatever. These schemes all waste space, but unless malloc's internal structures are the size of the alignment block, it's almost unavoidable to waste some space; the only way around it I can see is if the program also allocates lots of small, odd-shaped, unaligned blocks of memory that can be used to fill the gaps (and even then I doubt any sensible malloc implementation fills in little gaps like this, since it seems likely to lead to memory fragmentation). A posix_memalign that is built into malloc can do better than any implementation that isn't, though, with the possible exception of a specialized pool allocator built with aligned allocation in mind. Anne
Well, it can be done in Python: just allocate a too-big ndarray and take a slice that's the right shape and has the right alignment. But this sucks. Stephen G. Johnson posted code earlier in this thread that provides a portable aligned-memory allocator - it handles the freeing by (always) storing enough information to recover the original pointer in the padding space. (This means you always need to pad, which is a pain, but there's not much you can do about that.) This is indeed no rocket science, I feel a bit ashamed :) I don't see
His implementation stores the original pointer just before the beginning of the aligned data, so _aligned_free is free(((void**)ptr)[-1]). If you were worried about space (rather than time) you could store a single byte just before the pointer whose value indicated how much padding was done, or whatever. I really don't see how space would be a problem in our situation: it is not like we will pad more than a few bytes; in the case it is, I don't see how python would be the right choice anymore anyway. I will try to
Anne Archibald wrote: the problem with padding (except wasted time) ? prepare a patch the next few days, then. cheers, David
On Tue, Aug 07, 2007 at 01:33:24AM -0400, Anne Archibald wrote:
Well, it can be done in Python: just allocate a too-big ndarray and take a slice that's the right shape and has the right alignment. But this sucks.
Could you explain to me why is this such a bad idea? Stéfan
On 08/08/2007, Stefan van der Walt <stefan@sun.ac.za> wrote:
On Tue, Aug 07, 2007 at 01:33:24AM -0400, Anne Archibald wrote:
Well, it can be done in Python: just allocate a too-big ndarray and take a slice that's the right shape and has the right alignment. But this sucks.
Could you explain to me why is this such a bad idea?
Oh. Well, it's not *terrible*; it gets you an aligned array. But you have to allocate the original array as a 1D byte array (to allow for arbitrary realignments) and then align it, reshape it, and reinterpret it as a new type. Plus you're allocating an extra ndarray structure, which will live as long as the new array does; this not only wastes even more memory than the portable alignment solutions, it clogs up python's garbage collector. It's not outrageous, if you need aligned arrays *now*, on a released version of numpy, but numpy itself should do better. Anne
On 8/8/07, Anne Archibald <peridot.faceted@gmail.com> wrote:
On 08/08/2007, Stefan van der Walt <stefan@sun.ac.za> wrote:
On Tue, Aug 07, 2007 at 01:33:24AM -0400, Anne Archibald wrote:
Well, it can be done in Python: just allocate a too-big ndarray and take a slice that's the right shape and has the right alignment. But this sucks.
Could you explain to me why is this such a bad idea?
Oh. Well, it's not *terrible*; it gets you an aligned array. But you have to allocate the original array as a 1D byte array (to allow for arbitrary realignments) and then align it, reshape it, and reinterpret it as a new type. Plus you're allocating an extra ndarray structure, which will live as long as the new array does; this not only wastes even more memory than the portable alignment solutions, it clogs up python's garbage collector.
The ndarray structure doesn't take up much memory, it is the data that is large and the data is shared between the original array and the slice. Nor does the data type of the slice need changing, one simply uses the desired type to begin with, or at least a type of the right size so that a view will do the job without copies. Nor do I see how the garbage collector will get clogged up, slices are a common feature of using numpy. The slice method also has the advantage of being compiler and operating system independent, there is a reason Intel used that approach. Aligning multidimensional arrays might indeed be complicated, but I suspect those complications will be easier to handle in Python than in C. Chuck
On 08/08/2007, Charles R Harris <charlesr.harris@gmail.com> wrote:
On 8/8/07, Anne Archibald <peridot.faceted@gmail.com> wrote:
Oh. Well, it's not *terrible*; it gets you an aligned array. But you have to allocate the original array as a 1D byte array (to allow for arbitrary realignments) and then align it, reshape it, and reinterpret it as a new type. Plus you're allocating an extra ndarray structure, which will live as long as the new array does; this not only wastes even more memory than the portable alignment solutions, it clogs up python's garbage collector.
The ndarray structure doesn't take up much memory, it is the data that is large and the data is shared between the original array and the slice. Nor does the data type of the slice need changing, one simply uses the desired type to begin with, or at least a type of the right size so that a view will do the job without copies. Nor do I see how the garbage collector will get clogged up, slices are a common feature of using numpy. The slice method also has the advantage of being compiler and operating system independent, there is a reason Intel used that approach.
Aligning multidimensional arrays might indeed be complicated, but I suspect those complications will be easier to handle in Python than in C.
Can we assume that numpy arrays allocated to contain (say) complex64s are aligned to a 16-byte boundary? I don't think they will necessarily, so the shift we need may not be an integer number of complex64s. float96s pose even more problems. So to ensure alignment, we do need to do type conversion; if we're doing it anyway, byte arrays require the least trust in malloc(). The ndarray object isn't too big, probably some twenty or thirty bytes, so I'm not talking about a huge waste. But it is a python object, and the garbage collector needs to walk the whole tree of accessible python objects every time it runs, so this is one more object on the list. As an aside: numpy's handling of ndarray objects is actually not ideal; if you want to exhaust memory on your system, do: a = arange(5) while True: a = a[::-1] Each ndarray object keeps alive the ndarray object it is a slice of, so this operation creates an ever-growing linked list of ndarray objects. Seems to me it would be better to keep a pointer only to the original object that holds the address of the buffer (so it can be freed). Aligning multidimensional arrays is an interesting question. To first order, aligning the first element should be enough. If the dimensions of the array are not divisible by the alignment, though, this means that lower-dimensional complete slices may not be aligned: A = aligned_empty((7,5),dtype=float,alignment=16) Then A is aligned, as is A[0,:], but A[1,:] is not. So in this case we might want to actually allocate an 8-by-5 array and take a slice. This does mean it won't be contiguous in memory, so that flattening it requires a copy (which may not wind up aligned). This is something we might want to do - that is, make available as an option - in python. Anne
Anne, On 8/8/07, Anne Archibald <peridot.faceted@gmail.com> wrote:
On 08/08/2007, Charles R Harris <charlesr.harris@gmail.com> wrote:
On 8/8/07, Anne Archibald <peridot.faceted@gmail.com> wrote:
Oh. Well, it's not *terrible*; it gets you an aligned array. But you have to allocate the original array as a 1D byte array (to allow for arbitrary realignments) and then align it, reshape it, and reinterpret it as a new type. Plus you're allocating an extra ndarray structure, which will live as long as the new array does; this not only wastes even more memory than the portable alignment solutions, it clogs up python's garbage collector.
The ndarray structure doesn't take up much memory, it is the data that
is
large and the data is shared between the original array and the slice. Nor does the data type of the slice need changing, one simply uses the desired type to begin with, or at least a type of the right size so that a view will do the job without copies. Nor do I see how the garbage collector will get clogged up, slices are a common feature of using numpy. The slice method also has the advantage of being compiler and operating system independent, there is a reason Intel used that approach.
Aligning multidimensional arrays might indeed be complicated, but I suspect those complications will be easier to handle in Python than in C.
Can we assume that numpy arrays allocated to contain (say) complex64s are aligned to a 16-byte boundary? I don't think they will necessarily, so the shift we need may not be an integer number of complex64s. float96s pose even more problems. So to ensure alignment, we do need to do type conversion; if we're doing it anyway, byte arrays require the least trust in malloc().
I think that is a safe assumption, it is probably almost as safe as assuming binary and two's complement, likely more safe than assuming ieee 784. I expect almost all 32 bit OS's to align on 4 byte boundaries at worst, 64 bit machines to align on 8 byte boundaries. Even C structures are typically filled out with blanks to preserve some sort of alignment. That is because of addressing efficiency, or even the impossibility of odd addressing -- depends on the architecture. Sometimes even byte addressing is easier to get by putting a larger integer on the bus and extracting the relevant part. In addition, I expect the heap implementation to make some alignment decisions for efficiency. My 64 bit linux on Intel aligns arrays, whatever the data type, on 16 byte boundaries. It might be interesting to see what happens with the Intel and MSVC comipilers, but I expect similar results. PPC's, Sun and SGI need to be checked, but I don't expect problems. I think that will cover almost all architectures numpy is likely to run on.
The ndarray object isn't too big, probably some twenty or thirty bytes, so I'm not talking about a huge waste. But it is a python object, and the garbage collector needs to walk the whole tree of accessible python objects every time it runs, so this is one more object on the list.
As an aside: numpy's handling of ndarray objects is actually not ideal; if you want to exhaust memory on your system, do:
a = arange(5) while True: a = a[::-1]
Well, that's a pathological case present in numpy. Fixing it doesn't seem to be a high priority although there is a ticket somewhere. Each ndarray object keeps alive the ndarray object it is a slice of,
so this operation creates an ever-growing linked list of ndarray objects. Seems to me it would be better to keep a pointer only to the original object that holds the address of the buffer (so it can be freed).
Aligning multidimensional arrays is an interesting question. To first order, aligning the first element should be enough. If the dimensions of the array are not divisible by the alignment, though, this means that lower-dimensional complete slices may not be aligned:
A = aligned_empty((7,5),dtype=float,alignment=16)
Then A is aligned, as is A[0,:], but A[1,:] is not.
So in this case we might want to actually allocate an 8-by-5 array and take a slice. This does mean it won't be contiguous in memory, so that flattening it requires a copy (which may not wind up aligned). This is something we might want to do - that is, make available as an option - in python.
I think that is better viewed as need based. I suspect that if you really need such alignment it is better to start with array dimensions that will naturally align the rows. It will be impossible to naturally align all the columnes unless the data type is the correct size. Chuck
My 64 bit linux on Intel aligns arrays, whatever the data type, on 16 byte boundaries. It might be interesting to see what happens with the Intel and MSVC comipilers, but I expect similar results.
According to the doc on the msdn, the data should be 16-bits aligned. Matthieu
On 8/8/07, Matthieu Brucher <matthieu.brucher@gmail.com> wrote:
My 64 bit linux on Intel aligns arrays, whatever the data type, on 16 byte
boundaries. It might be interesting to see what happens with the Intel and MSVC comipilers, but I expect similar results.
According to the doc on the msdn, the data should be 16-bits aligned.
Shades of DOS and 16 bit machines. Have you checked what actually happens on modern hardware? Chuck
Charles R Harris wrote:
Anne,
On 8/8/07, *Anne Archibald* <peridot.faceted@gmail.com <mailto:peridot.faceted@gmail.com>> wrote:
On 08/08/2007, Charles R Harris <charlesr.harris@gmail.com <mailto:charlesr.harris@gmail.com>> wrote: > > > On 8/8/07, Anne Archibald <peridot.faceted@gmail.com <mailto:peridot.faceted@gmail.com>> wrote: > > Oh. Well, it's not *terrible*; it gets you an aligned array. But you > > have to allocate the original array as a 1D byte array (to allow for > > arbitrary realignments) and then align it, reshape it, and reinterpret > > it as a new type. Plus you're allocating an extra ndarray structure, > > which will live as long as the new array does; this not only wastes > > even more memory than the portable alignment solutions, it clogs up > > python's garbage collector. > > The ndarray structure doesn't take up much memory, it is the data that is > large and the data is shared between the original array and the slice. Nor > does the data type of the slice need changing, one simply uses the desired > type to begin with, or at least a type of the right size so that a view will > do the job without copies. Nor do I see how the garbage collector will get > clogged up, slices are a common feature of using numpy. The slice method > also has the advantage of being compiler and operating system independent, > there is a reason Intel used that approach.
I am not sure to understand which approach to which problem you are talking about here ? IMHO, the discussion is becoming a bit carried away. What I was suggesting is - being able to check whether a given data buffer is aligned to a given alignment (easy) - being able to request an aligned data buffer: requires aligned memory allocators, and some additions to the API for creating arrays. This all boils down to the following case: I have a C function which requires N bytes aligned data, I want the numpy API to provide this capability. I don't understand the discussion on doing it in python: first, this means you cannot request a data buffer at the C level, and I don't understand the whole discussion on slice, multi dimension and so on either: at the C level, different libraries may need different arrays formats, and in the case of fftw, all it cares about is the alignment of the data pointer. For contiguous, C order arrays, as long as the data pointer is aligned, I don't think we need more; are some people familiar with the MKL, who could tell whether we need more ? cheers, David
On 8/8/07, David Cournapeau <david@ar.media.kyoto-u.ac.jp> wrote:
Charles R Harris wrote:
Anne,
On 8/8/07, *Anne Archibald* <peridot.faceted@gmail.com <mailto:peridot.faceted@gmail.com>> wrote:
On 08/08/2007, Charles R Harris <charlesr.harris@gmail.com <mailto:charlesr.harris@gmail.com>> wrote: > > > On 8/8/07, Anne Archibald <peridot.faceted@gmail.com <mailto:peridot.faceted@gmail.com>> wrote: > > Oh. Well, it's not *terrible*; it gets you an aligned array. But you > > have to allocate the original array as a 1D byte array (to allow for > > arbitrary realignments) and then align it, reshape it, and reinterpret > > it as a new type. Plus you're allocating an extra ndarray structure, > > which will live as long as the new array does; this not only wastes > > even more memory than the portable alignment solutions, it clogs up > > python's garbage collector. > > The ndarray structure doesn't take up much memory, it is the data that is > large and the data is shared between the original array and the slice. Nor > does the data type of the slice need changing, one simply uses the desired > type to begin with, or at least a type of the right size so that a view will > do the job without copies. Nor do I see how the garbage collector will get > clogged up, slices are a common feature of using numpy. The slice method > also has the advantage of being compiler and operating system independent, > there is a reason Intel used that approach.
I am not sure to understand which approach to which problem you are talking about here ?
IMHO, the discussion is becoming a bit carried away. What I was suggesting is - being able to check whether a given data buffer is aligned to a given alignment (easy) - being able to request an aligned data buffer: requires aligned memory allocators, and some additions to the API for creating arrays.
This all boils down to the following case: I have a C function which requires N bytes aligned data, I want the numpy API to provide this capability. I don't understand the discussion on doing it in python:
Well, what you want might be very easy to do in python, we just need to check the default alignments for doubles and floats for some of the other compilers, architectures, and OS's out there. On the other hand, you might not be able to request a c malloc that is aligned in a portable way without resorting to the same tricks as you do in python. So why not use python and get the reference counting and garbage collection along with it? What we want are doubles 8 byte aligned and floats 4 byte aligned. That seems to be the case with gcc, linux, and the Intel architecture. The idea is to create a slightly oversize array, then use a slice of the proper size that is 16 byte aligned. Chuck
Charles R Harris wrote:
Well, what you want might be very easy to do in python, we just need to check the default alignments for doubles and floats for some of the other compilers, architectures, and OS's out there. On the other hand, you might not be able to request a c malloc that is aligned in a portable way without resorting to the same tricks as you do in python. So why not use python and get the reference counting and garbage collection along with it?
First, doing it in python means that I cannot use the facility from C easily. But this is exactly where I need it, and where I would guess most people need it. People want to interface numpy with the mkl ? They will do it in C, right ? And maybe I am just too dumb to see the problem, but I don't see the need for garbage collection and so on :) Again, what is needed is: - aligned allocator -> we can use the one from Steven Johnson, used in fftw, which support more or less the same archs than numpy - Refactor the array creation functions in C such as the implementation takes one additional alignement argument, and the original functions are kept identical to before - Add a few utilities function to check whether it is SSE aligned, arbitrary aligned, etc... The only non trivial point is 2 . Actually, when I first thought about it, I thought about fixing alignement at compile time, which would have made it totally avoidable: it would have been a simple change of the definition of PyDataMem_New to an aligned malloc with a constant. I have already the code for this, and besides aligned malloc code, it is like a 5 lines change of numpy code, nothing terrible, really. David
On 8/9/07, David Cournapeau <david@ar.media.kyoto-u.ac.jp> wrote:
Charles R Harris wrote:
Well, what you want might be very easy to do in python, we just need to check the default alignments for doubles and floats for some of the other compilers, architectures, and OS's out there. On the other hand, you might not be able to request a c malloc that is aligned in a portable way without resorting to the same tricks as you do in python. So why not use python and get the reference counting and garbage collection along with it?
First, doing it in python means that I cannot use the facility from C easily. But this is exactly where I need it, and where I would guess most people need it. People want to interface numpy with the mkl ? They will do it in C, right ? And maybe I am just too dumb to see the problem, but I don't see the need for garbage collection and so on :) Again, what is needed is: - aligned allocator -> we can use the one from Steven Johnson, used in fftw, which support more or less the same archs than numpy - Refactor the array creation functions in C such as the implementation takes one additional alignement argument, and the original functions are kept identical to before - Add a few utilities function to check whether it is SSE aligned, arbitrary aligned, etc...
The only non trivial point is 2 . Actually, when I first thought about it, I thought about fixing alignement at compile time, which would have made it totally avoidable: it would have been a simple change of the definition of PyDataMem_New to an aligned malloc with a constant. I have already the code for this, and besides aligned malloc code, it is like a 5 lines change of numpy code, nothing terrible, really.
Ah, you want it in C. Well, I think it would not be too difficult to change PyDataMem_New, however, the function signature would change and all the code that used it would break. That is pretty drastic. Better to define PyDataMem_New_Aligned, then redefine PyDataMem_New to use the new function. That way nothing breaks and you get the function you need. I don't think Travis would get upset if you added such a function and documented it. Chuck
On 8/9/07, David Cournapeau <david@ar.media.kyoto-u.ac.jp> wrote:
Charles R Harris wrote:
Ah, you want it in C.
What would be the use to get SIMD aligned arrays in python ?
If I wanted a fairly specialized routine and didn't want to touch the guts of numpy, I would pass the aligned array to a C function and use the data pointer. The python code would be just a high level wrapper. You might even be able to use ctypes to pass the pointer into a library function. It's not necessary to code everything in C using the python C API. Chuck
On 8/9/07, Charles R Harris <charlesr.harris@gmail.com> wrote:
On 8/9/07, David Cournapeau <david@ar.media.kyoto-u.ac.jp> wrote:
Charles R Harris wrote:
Ah, you want it in C.
What would be the use to get SIMD aligned arrays in python ?
If I wanted a fairly specialized routine and didn't want to touch the guts of numpy, I would pass the aligned array to a C function and use the data pointer. The python code would be just a high level wrapper. You might even be able to use ctypes to pass the pointer into a library function. It's not necessary to code everything in C using the python C API.
I certainly do not argue on this point. But if it was specialized, there would be no point putting in in numpy in the first place. What I hope is that at some point, the aligned allocators can be used inside core numpy to optimize things internally (ufunc, etc...). Those facilities would be really useful for many optimized libraries, which are all C: as such, doing it in C makes sense, no ? David
On Thu, Aug 09, 2007 at 04:52:38PM +0900, David Cournapeau wrote:
Charles R Harris wrote:
Well, what you want might be very easy to do in python, we just need to check the default alignments for doubles and floats for some of the other compilers, architectures, and OS's out there. On the other hand, you might not be able to request a c malloc that is aligned in a portable way without resorting to the same tricks as you do in python. So why not use python and get the reference counting and garbage collection along with it?
First, doing it in python means that I cannot use the facility from C easily. But this is exactly where I need it, and where I would guess most people need it. People want to interface numpy with the mkl ? They will do it in C, right ?
It doesn't really matter where the memory allocation occurs, does it? As far as I understand, the underlying fftw function has some flag to indicate when the data is aligned. If so, we could expose that flag in Python, and do something like x = align16(data) _fft(x, is_aligned=True) I am not intimately familiar with the fft wrappers, so maybe I'm missing something more fundamental. Cheers Stéfan
On 8/9/07, Stefan van der Walt <stefan@sun.ac.za> wrote:
It doesn't really matter where the memory allocation occurs, does it? As far as I understand, the underlying fftw function has some flag to indicate when the data is aligned. If so, we could expose that flag in Python, and do something like
x = align16(data) _fft(x, is_aligned=True)
I am not intimately familiar with the fft wrappers, so maybe I'm missing something more fundamental.
You can do that, but this is only a special case of what I have in mind. For example, what if you want to call functions which are relatively cheap, but called many times, and want an aligned array ? Going back and forth would be a huge waste. Also, having aligned buffers internally (in C_, even for non array data, can be useful (eg filters, and maybe even core numpy functionalities like ufunc, etc...). Another point I forgot to mention before is that we can define a default alignment which would already be SIMD friendly (as done on Mac OS X or FreeBSD by default malloc) for *all* numpy arrays at 0 cost: for fft, this means that most arrays would already by as wanted, meaning a huge boost of performances for free. Basically, the functionalities would be more usable in C, without too much constraint, because frankly, the implementation is not difficult: I have something almost ready, and the patch is 7kb, including code to detect platform dependent aligned allocator. The C code can be tested really easily (since it is independent of python). David
On 8/9/07, Charles R Harris <charlesr.harris@gmail.com> wrote:
On 8/8/07, David Cournapeau <david@ar.media.kyoto-u.ac.jp> wrote:
Charles R Harris wrote:
Anne,
On 8/8/07, *Anne Archibald* <peridot.faceted@gmail.com <mailto: peridot.faceted@gmail.com>> wrote:
On 08/08/2007, Charles R Harris <charlesr.harris@gmail.com <mailto: charlesr.harris@gmail.com>> wrote: > > > On 8/8/07, Anne Archibald <peridot.faceted@gmail.com <mailto: peridot.faceted@gmail.com>> wrote: > > Oh. Well, it's not *terrible*; it gets you an aligned array. But you > > have to allocate the original array as a 1D byte array (to allow for > > arbitrary realignments) and then align it, reshape it, and reinterpret > > it as a new type. Plus you're allocating an extra ndarray structure, > > which will live as long as the new array does; this not only wastes > > even more memory than the portable alignment solutions, it clogs up > > python's garbage collector. > > The ndarray structure doesn't take up much memory, it is the data that is > large and the data is shared between the original array and the slice. Nor > does the data type of the slice need changing, one simply uses the desired > type to begin with, or at least a type of the right size so that a view will > do the job without copies. Nor do I see how the garbage collector will get > clogged up, slices are a common feature of using numpy. The slice method > also has the advantage of being compiler and operating system independent, > there is a reason Intel used that approach.
I am not sure to understand which approach to which problem you are talking about here ?
IMHO, the discussion is becoming a bit carried away. What I was suggesting is - being able to check whether a given data buffer is aligned to a given alignment (easy) - being able to request an aligned data buffer: requires aligned memory allocators, and some additions to the API for creating arrays.
This all boils down to the following case: I have a C function which requires N bytes aligned data, I want the numpy API to provide this capability. I don't understand the discussion on doing it in python:
Well, what you want might be very easy to do in python, we just need to check the default alignments for doubles and floats for some of the other compilers, architectures, and OS's out there. On the other hand, you might not be able to request a c malloc that is aligned in a portable way without resorting to the same tricks as you do in python. So why not use python and get the reference counting and garbage collection along with it? What we want are doubles 8 byte aligned and floats 4 byte aligned. That seems to be the case with gcc, linux, and the Intel architecture. The idea is to create a slightly oversize array, then use a slice of the proper size that is 16 byte aligned.
Chuck
For instance, in the case of linux-x86 and linux-x86_64, the following should work: In [68]: def align16(n,dtype=float64) : ....: size = dtype().dtype.itemsize ....: over = 16/size ....: data = empty(n + over, dtype=dtype) ....: skip = (- data.ctypes.data % 16)/size ....: return data[skip:skip + n] Of course, now you need to fill in the data. Chuck
For platforms without posix_memalign, I don't see how to implement a memory allocator with an arbitrary alignment (more precisely, I don't see how to free it if I cannot assume a fixed alignement: how do I know where the "real" pointer is ?).
Visual Studio seems to offer a counter part (also note that malloc is supposed to return a pointer on a 16bits boundary) which is called _aligned_malloc ( http://msdn2.microsoft.com/en-us/library/8z34s9c6(VS.80).aspx). It should be what you need, at least for Windows/MSVC. Matthieu
On 8/6/07, Anne Archibald <peridot.faceted@gmail.com> wrote:
On 06/08/07, David Cournapeau <david@ar.media.kyoto-u.ac.jp> wrote:
Well, when I proposed the SIMD extension, I was willing to implement the proposal, and this was for a simple goal: enabling better integration with many numeric libraries which need SIMD alignment.
As nice as a custom allocator might be, I will certainly not implement it myself. For SIMD, I think the weight adding complexity / benefit worth it (since there is not much change to the API and implementation), and I know more or less how to do it; for custom allocator, that's an entirely different story. That's really more complex; static pools may be useful in some cases (but that's not obvious, since only the data are allocated with this buffer, everything else being allocated through the python memory allocator, and numpy arrays have pretty simple memory allocation patterns).
I have to agree. I can hardly volunteer David for anything, and I don't have time to implement this myself, but I think a custom allocator is a rather special-purpose tool; if one were to implement one, I think the way to go would be to implement a subclass of ndarray (or just a constructor) that allocated the memory. This could be done from python, since you can make an ndarray from scratch using a given memory array. Of course, making temporaries be allocated with the correct allocator will be very complicated, since it's unclear which allocator should be used.
Maybe I'm missing something, but handling the temporaries is automatic. Just return the appropriate slice from an array created in a subroutine. The original array gets its reference count decremented when the routine exits but the slice will still hold one. When the slice is deleted all the allocated memory will get garbage collected. Chuck
participants (10)
-
Andrew Straw
-
Anne Archibald
-
Charles R Harris
-
David Cournapeau
-
David Cournapeau
-
Lisandro Dalcin
-
Matthew Brett
-
Matthieu Brucher
-
Stefan van der Walt
-
Steven G. Johnson