I've written a few simple cython routines for assisting in creating byte-aligned numpy arrays. The point being for the arrays to work with SSE/AVX code. https://github.com/hgomersall/pyFFTW/blob/master/pyfftw/utils.pxi The change recently has been to add a check on the CPU as to what flags are supported (though it's not complete, I should make the default return 0 or something). It occurred to me that this is something that (a) other people almost certainly need and are solving themselves and (b) I lack the necessary platforms to test all the possible CPU/OS combinations to make sure something sensible happens in all cases. Is this something that can be rolled into Numpy (the feature, not my particular implementation or interface - though I'd be happy for it to be so)? Regarding (b), I've written a test case that works for Linux on x86-64 with GCC (my platform!). I can test it on 32-bit windows, but that's it. Is ARM supported by Numpy? Neon would be great to include as well. What other platforms might need this? Cheers, Henry
On Wed, Dec 19, 2012 at 8:40 AM, Henry Gomersall <heng@cantab.net> wrote:
I've written a few simple cython routines for assisting in creating byte-aligned numpy arrays. The point being for the arrays to work with SSE/AVX code.
https://github.com/hgomersall/pyFFTW/blob/master/pyfftw/utils.pxi
The change recently has been to add a check on the CPU as to what flags are supported (though it's not complete, I should make the default return 0 or something).
It occurred to me that this is something that (a) other people almost certainly need and are solving themselves and (b) I lack the necessary platforms to test all the possible CPU/OS combinations to make sure something sensible happens in all cases.
Is this something that can be rolled into Numpy (the feature, not my particular implementation or interface - though I'd be happy for it to be so)?
Regarding (b), I've written a test case that works for Linux on x86-64 with GCC (my platform!). I can test it on 32-bit windows, but that's it. Is ARM supported by Numpy? Neon would be great to include as well. What other platforms might need this?
Your code looks simple and portable to me (at least the alignment part). I can see a good argument for adding this sort of functionality directly to numpy with a nice interface, though, since these kind of requirements seem quite common these days. Maybe an interface like a = np.asarray([1, 2, 3], base_alignment=32) # should this be in bits or in bytes? b = np.empty((10, 10), order="C", base_alignment=32) # etc. assert a.base_alignment == 32 which underneath tries to use posix_memalign/_aligned_malloc when possible, or falls back on the overallocation trick otherwise? -n
On Wed, Dec 19, 2012 at 7:43 AM, Nathaniel Smith <njs@pobox.com> wrote:
On Wed, Dec 19, 2012 at 8:40 AM, Henry Gomersall <heng@cantab.net> wrote:
I've written a few simple cython routines for assisting in creating byte-aligned numpy arrays. The point being for the arrays to work with SSE/AVX code.
https://github.com/hgomersall/pyFFTW/blob/master/pyfftw/utils.pxi
The change recently has been to add a check on the CPU as to what flags are supported (though it's not complete, I should make the default return 0 or something).
It occurred to me that this is something that (a) other people almost certainly need and are solving themselves and (b) I lack the necessary platforms to test all the possible CPU/OS combinations to make sure something sensible happens in all cases.
Is this something that can be rolled into Numpy (the feature, not my particular implementation or interface - though I'd be happy for it to be so)?
Regarding (b), I've written a test case that works for Linux on x86-64 with GCC (my platform!). I can test it on 32-bit windows, but that's it. Is ARM supported by Numpy? Neon would be great to include as well. What other platforms might need this?
Your code looks simple and portable to me (at least the alignment part). I can see a good argument for adding this sort of functionality directly to numpy with a nice interface, though, since these kind of requirements seem quite common these days. Maybe an interface like a = np.asarray([1, 2, 3], base_alignment=32) # should this be in bits or in bytes? b = np.empty((10, 10), order="C", base_alignment=32) # etc. assert a.base_alignment == 32 which underneath tries to use posix_memalign/_aligned_malloc when possible, or falls back on the overallocation trick otherwise?
There is a thread about this from several years back. IIRC, David Cournapeau was interested in the same problem. At first glance, the alignment keyword looks interesting. One possible concern is keeping alignment for rows, views, etc., which is probably not possible in any sensible way. But people who need this most likely know what they are doing and just need memory allocated on the proper boundary. Chuck
On Wed, Dec 19, 2012 at 2:57 PM, Charles R Harris <charlesr.harris@gmail.com> wrote:
On Wed, Dec 19, 2012 at 7:43 AM, Nathaniel Smith <njs@pobox.com> wrote:
On Wed, Dec 19, 2012 at 8:40 AM, Henry Gomersall <heng@cantab.net> wrote:
I've written a few simple cython routines for assisting in creating byte-aligned numpy arrays. The point being for the arrays to work with SSE/AVX code.
https://github.com/hgomersall/pyFFTW/blob/master/pyfftw/utils.pxi
The change recently has been to add a check on the CPU as to what flags are supported (though it's not complete, I should make the default return 0 or something).
It occurred to me that this is something that (a) other people almost certainly need and are solving themselves and (b) I lack the necessary platforms to test all the possible CPU/OS combinations to make sure something sensible happens in all cases.
Is this something that can be rolled into Numpy (the feature, not my particular implementation or interface - though I'd be happy for it to be so)?
Regarding (b), I've written a test case that works for Linux on x86-64 with GCC (my platform!). I can test it on 32-bit windows, but that's it. Is ARM supported by Numpy? Neon would be great to include as well. What other platforms might need this?
Your code looks simple and portable to me (at least the alignment part). I can see a good argument for adding this sort of functionality directly to numpy with a nice interface, though, since these kind of requirements seem quite common these days. Maybe an interface like a = np.asarray([1, 2, 3], base_alignment=32) # should this be in bits or in bytes? b = np.empty((10, 10), order="C", base_alignment=32) # etc. assert a.base_alignment == 32 which underneath tries to use posix_memalign/_aligned_malloc when possible, or falls back on the overallocation trick otherwise?
There is a thread about this from several years back. IIRC, David Cournapeau was interested in the same problem. At first glance, the alignment keyword looks interesting. One possible concern is keeping alignment for rows, views, etc., which is probably not possible in any sensible way. But people who need this most likely know what they are doing and just need memory allocated on the proper boundary.
Right, my intuition is that it's like order="C" -- if you make a new array by, say, indexing, then it may or may not have order="C", no guarantees. So when you care, you call asarray(a, order="C") and that either makes a copy or not as needed. Similarly for base alignment. I guess to push this analogy even further we could define a set of array flags, ALIGNED_8, ALIGNED_16, etc. (In practice only power-of-2 alignment matters, I think, so the number of flags would remain manageable?) That would make the C API easier to deal with too, no need to add PyArray_FromAnyAligned. -n
On Wed, Dec 19, 2012 at 8:10 AM, Nathaniel Smith <njs@pobox.com> wrote:
On Wed, Dec 19, 2012 at 2:57 PM, Charles R Harris <charlesr.harris@gmail.com> wrote:
On Wed, Dec 19, 2012 at 7:43 AM, Nathaniel Smith <njs@pobox.com> wrote:
On Wed, Dec 19, 2012 at 8:40 AM, Henry Gomersall <heng@cantab.net>
I've written a few simple cython routines for assisting in creating byte-aligned numpy arrays. The point being for the arrays to work with SSE/AVX code.
https://github.com/hgomersall/pyFFTW/blob/master/pyfftw/utils.pxi
The change recently has been to add a check on the CPU as to what flags are supported (though it's not complete, I should make the default return 0 or something).
It occurred to me that this is something that (a) other people almost certainly need and are solving themselves and (b) I lack the necessary platforms to test all the possible CPU/OS combinations to make sure something sensible happens in all cases.
Is this something that can be rolled into Numpy (the feature, not my particular implementation or interface - though I'd be happy for it to be so)?
Regarding (b), I've written a test case that works for Linux on x86-64 with GCC (my platform!). I can test it on 32-bit windows, but that's it. Is ARM supported by Numpy? Neon would be great to include as well. What other platforms might need this?
Your code looks simple and portable to me (at least the alignment part). I can see a good argument for adding this sort of functionality directly to numpy with a nice interface, though, since these kind of requirements seem quite common these days. Maybe an interface like a = np.asarray([1, 2, 3], base_alignment=32) # should this be in bits or in bytes? b = np.empty((10, 10), order="C", base_alignment=32) # etc. assert a.base_alignment == 32 which underneath tries to use posix_memalign/_aligned_malloc when possible, or falls back on the overallocation trick otherwise?
There is a thread about this from several years back. IIRC, David Cournapeau was interested in the same problem. At first glance, the alignment keyword looks interesting. One possible concern is keeping alignment for rows, views, etc., which is probably not possible in any sensible way. But
wrote: people
who need this most likely know what they are doing and just need memory allocated on the proper boundary.
Right, my intuition is that it's like order="C" -- if you make a new array by, say, indexing, then it may or may not have order="C", no guarantees. So when you care, you call asarray(a, order="C") and that either makes a copy or not as needed. Similarly for base alignment.
I guess to push this analogy even further we could define a set of array flags, ALIGNED_8, ALIGNED_16, etc. (In practice only power-of-2 alignment matters, I think, so the number of flags would remain manageable?) That would make the C API easier to deal with too, no need to add PyArray_FromAnyAligned.
Another possibility is an aligned datatype, basically an aligned structured array with floats/ints in chunks of the appropriate size. IIRC, gcc support for sse is something like that. Chuck
On Wed, Dec 19, 2012 at 3:27 PM, Charles R Harris <charlesr.harris@gmail.com> wrote:
On Wed, Dec 19, 2012 at 8:10 AM, Nathaniel Smith <njs@pobox.com> wrote:
Right, my intuition is that it's like order="C" -- if you make a new array by, say, indexing, then it may or may not have order="C", no guarantees. So when you care, you call asarray(a, order="C") and that either makes a copy or not as needed. Similarly for base alignment.
I guess to push this analogy even further we could define a set of array flags, ALIGNED_8, ALIGNED_16, etc. (In practice only power-of-2 alignment matters, I think, so the number of flags would remain manageable?) That would make the C API easier to deal with too, no need to add PyArray_FromAnyAligned.
Another possibility is an aligned datatype, basically an aligned structured array with floats/ints in chunks of the appropriate size. IIRC, gcc support for sse is something like that.
True; right now it looks like structured dtypes have no special alignment: In [13]: np.dtype("f4,f4").alignment Out[13]: 1 So for this approach we'd need a way to create structured dtypes with .alignment == .itemsize, and we'd need some way to request dtype-aligned memory from array allocation functions. I guess existing NPY_ALIGNED is a good enough public interface for the latter, but AFAICT the current implementation is to just assume that whatever malloc() returns will always be ALIGNED. This is true for all base C types, but not for more exotic record types with larger alignment requirements -- that would require some fancier allocation scheme. Not sure which interface is more useful to users. On the one hand, using funny dtypes makes regular non-SIMD access more cumbersome, and it forces your array size to be a multiple of the SIMD word size, which might be inconvenient if your code is smart enough to handle arbitrary-sized arrays with partial SIMD acceleration (i.e., using SIMD for most of the array, and then a slow path to handle any partial word at the end). OTOH, if your code *is* that smart, you should probably just make it smart enough to handle a partial word at the beginning as well and then you won't need any special alignment in the first place, and representing each SIMD word as a single numpy scalar is an intuitively appealing model of how SIMD works. OTOOH, just adding a single argument np.array() is a much simpler to explain than some elaborate scheme involving the creation of special custom dtypes. -n
On Wed, 2012-12-19 at 15:57 +0000, Nathaniel Smith wrote:
Not sure which interface is more useful to users. On the one hand, using funny dtypes makes regular non-SIMD access more cumbersome, and it forces your array size to be a multiple of the SIMD word size, which might be inconvenient if your code is smart enough to handle arbitrary-sized arrays with partial SIMD acceleration (i.e., using SIMD for most of the array, and then a slow path to handle any partial word at the end). OTOH, if your code *is* that smart, you should probably just make it smart enough to handle a partial word at the beginning as well and then you won't need any special alignment in the first place, and representing each SIMD word as a single numpy scalar is an intuitively appealing model of how SIMD works. OTOOH, just adding a single argument np.array() is a much simpler to explain than some elaborate scheme involving the creation of special custom dtypes.
If it helps, my use-case is in wrapping the FFTW library. This _is_ smart enough to deal with unaligned arrays, but it just results in a performance penalty. In the case of an FFT, there are clearly going to be issues with the powers of two indices in the array not lying on a suitable n-byte boundary (which would be the case with a misaligned array), but I imagine it's not unique. The other point is that it's easy to create a suitable power of two array that should always bypass any special case unaligned code (e.g. with floats, any multiple of 4 array length will fill every 16-byte word). Finally, I think there is significant value in auto-aligning the array based on an appropriate inspection of the cpu capabilities (or alternatively, a function that reports back the appropriate SIMD alignment). Again, this makes it easier to wrap libraries that may function with any alignment, but benefit from optimum alignment. Cheers, Henry
On 12/19/12 5:47 PM, Henry Gomersall wrote:
Not sure which interface is more useful to users. On the one hand, using funny dtypes makes regular non-SIMD access more cumbersome, and it forces your array size to be a multiple of the SIMD word size, which might be inconvenient if your code is smart enough to handle arbitrary-sized arrays with partial SIMD acceleration (i.e., using SIMD for most of the array, and then a slow path to handle any partial word at the end). OTOH, if your code *is* that smart, you should probably just make it smart enough to handle a partial word at the beginning as well and then you won't need any special alignment in the first place, and representing each SIMD word as a single numpy scalar is an intuitively appealing model of how SIMD works. OTOOH, just adding a single argument np.array() is a much simpler to explain than some elaborate scheme involving the creation of special custom dtypes. If it helps, my use-case is in wrapping the FFTW library. This _is_ smart enough to deal with unaligned arrays, but it just results in a
On Wed, 2012-12-19 at 15:57 +0000, Nathaniel Smith wrote: performance penalty. In the case of an FFT, there are clearly going to be issues with the powers of two indices in the array not lying on a suitable n-byte boundary (which would be the case with a misaligned array), but I imagine it's not unique.
The other point is that it's easy to create a suitable power of two array that should always bypass any special case unaligned code (e.g. with floats, any multiple of 4 array length will fill every 16-byte word).
Finally, I think there is significant value in auto-aligning the array based on an appropriate inspection of the cpu capabilities (or alternatively, a function that reports back the appropriate SIMD alignment). Again, this makes it easier to wrap libraries that may function with any alignment, but benefit from optimum alignment.
Hmm, NumPy seems to return data blocks that are aligned to 16 bytes on systems (Linux and Mac OSX): In []: np.empty(1).data Out[]: <read-write buffer for 0x102b97b60, size 8, offset 0 at 0x102e7c130> In []: np.empty(1).data Out[]: <read-write buffer for 0x102ba64e0, size 8, offset 0 at 0x102e7c430> In []: np.empty(1).data Out[]: <read-write buffer for 0x102b86700, size 8, offset 0 at 0x102e7c5b0> In []: np.empty(1).data Out[]: <read-write buffer for 0x102b981d0, size 8, offset 0 at 0x102e7c5f0> [Check that the last digit in the addresses above is always 0] The only scenario that I see that this would create unaligned arrays is for machines having AVX. But provided that the Intel architecture is making great strides in fetching unaligned data, I'd be surprised that the difference in performance would be even noticeable. Can you tell us which difference in performance are you seeing for an AVX-aligned array and other that is not AVX-aligned? Just curious. -- Francesc Alted
On Wed, 2012-12-19 at 19:03 +0100, Francesc Alted wrote: <snip>
Finally, I think there is significant value in auto-aligning the array based on an appropriate inspection of the cpu capabilities (or alternatively, a function that reports back the appropriate SIMD alignment). Again, this makes it easier to wrap libraries that may function with any alignment, but benefit from optimum alignment.
Hmm, NumPy seems to return data blocks that are aligned to 16 bytes on systems (Linux and Mac OSX): <snip>
That is not true at least under Windows 32-bit. I think also it's not true for Linux 32-bit from my vague recollections of testing in a virtual machine. (disclaimer: both those statements _may_ be out of date). But yes, under Linux 64-bit I always get my arrays aligned to 16 bytes.
The only scenario that I see that this would create unaligned arrays is for machines having AVX. But provided that the Intel architecture is making great strides in fetching unaligned data, I'd be surprised that the difference in performance would be even noticeable.
Can you tell us which difference in performance are you seeing for an AVX-aligned array and other that is not AVX-aligned? Just curious.
I don't know; I don't own a machine with AVX ;) It might be that the difference is negligible, though I do think it would be _nice_ to have the arrays properly aligned if it's not too difficult. Cheers, Henry
On Wed, Dec 19, 2012 at 6:25 PM, Henry Gomersall <heng@cantab.net> wrote:
On Wed, 2012-12-19 at 19:03 +0100, Francesc Alted wrote: <snip>
Finally, I think there is significant value in auto-aligning the array based on an appropriate inspection of the cpu capabilities (or alternatively, a function that reports back the appropriate SIMD alignment). Again, this makes it easier to wrap libraries that may function with any alignment, but benefit from optimum alignment.
Hmm, NumPy seems to return data blocks that are aligned to 16 bytes on systems (Linux and Mac OSX): <snip>
That is not true at least under Windows 32-bit. I think also it's not true for Linux 32-bit from my vague recollections of testing in a virtual machine. (disclaimer: both those statements _may_ be out of date).
But yes, under Linux 64-bit I always get my arrays aligned to 16 bytes.
Currently numpy just uses whatever the system malloc() returns, so the alignment guarantees are entirely determined by your libc. -n
On 19.12.2012 19:25, Henry Gomersall wrote:
That is not true at least under Windows 32-bit. I think also it's not true for Linux 32-bit from my vague recollections of testing in a virtual machine. (disclaimer: both those statements _may_ be out of date).
malloc is required to return memory on 16 byte boundary on Windows. http://msdn.microsoft.com/en-us/library/ycsb6wwf.aspx On Windows we can also use _aligned_malloc and _aligned_realloc to produce the alignment we want. _aligned_malloc(N, 4096); // page-aligned memory Sturla
On Thu, 2012-12-20 at 17:48 +0100, Sturla Molden wrote:
On 19.12.2012 19:25, Henry Gomersall wrote:
That is not true at least under Windows 32-bit. I think also it's not true for Linux 32-bit from my vague recollections of testing in a virtual machine. (disclaimer: both those statements _may_ be out of date).
malloc is required to return memory on 16 byte boundary on Windows.
http://msdn.microsoft.com/en-us/library/ycsb6wwf.aspx
On Windows we can also use _aligned_malloc and _aligned_realloc to produce the alignment we want.
_aligned_malloc(N, 4096); // page-aligned memory
Except I build with MinGW. Please don't tell me I need to install Visual Studio... I have about 1GB free on my windows partition! hen
On Thu, 2012-12-20 at 20:50 +0100, Sturla Molden wrote:
On 20.12.2012 18:38, Henry Gomersall wrote:
Except I build with MinGW. Please don't tell me I need to install Visual Studio... I have about 1GB free on my windows partition!
The same DLL is used as CRT.
Perhaps the DLL should go and read MS's edicts! hen
On Thu, 2012-12-20 at 20:57 +0100, Sturla Molden wrote:
On 20.12.2012 20:52, Henry Gomersall wrote:
Perhaps the DLL should go and read MS's edicts!
Do you link with same same CRT as Python? (msvcr90.dll)
You should always use -lmsvcr90.
If you don't, you will link with msvcrt.dll.
Hmmm, plausibly not. Why is it important? (for my own understanding) Cheers, Henry
On 20.12.2012 21:03, Henry Gomersall wrote:
Why is it important? (for my own understanding)
Because if CRT resources are shared between different CRT versions, bad things will happen (the ABIs are not equivalent, errno and other globals are at different addresses, etc.) Cython code tends to share CRT resources with Python. For example, your Cython code might (invisibly to us) invoke malloc to allocate space for an objent, and then Python will later call free. This should perferably go through the same DLL. Another thing is that msvcrt.dll is for Windows own system resources, not for user apps. Sturla
On 20.12.2012 21:13, Sturla Molden wrote:
Because if CRT resources are shared between different CRT versions, bad things will happen (the ABIs are not equivalent, errno and other globals are at different addresses, etc.)
For example, PyErr_SetFromErrno will return garbage if CRTs are shared. Sturla
On Thu, 2012-12-20 at 21:13 +0100, Sturla Molden wrote:
On 20.12.2012 21:03, Henry Gomersall wrote:
Why is it important? (for my own understanding)
Because if CRT resources are shared between different CRT versions, bad things will happen (the ABIs are not equivalent, errno and other globals are at different addresses, etc.) Cython code tends to share CRT resources with Python. For example, your Cython code might (invisibly to us) invoke malloc to allocate space for an objent, and then Python will later call free. This should perferably go through the same DLL.
I understood the general point about it being bad, but it was more specific to Cython, which I think you answered. Presumably it's less of an issue with pure C libs where calls to libc are more obvious.
Another thing is that msvcrt.dll is for Windows own system resources, not for user apps.
I didn't know that. It's a real pain having so many libc libs knocking around. I have little experience of Windows, as you may have guessed! hen
On 20.12.2012 21:24, Henry Gomersall wrote:
I didn't know that. It's a real pain having so many libc libs knocking around. I have little experience of Windows, as you may have guessed!
Originally there was only one system-wide CRT on Windows (msvcrt.dll), which is why MinGW linkes with that by default. But starting with the release of VS2003, Microsoft decided to reserve msvcrt.dll for system resources and create a libc "DLL Hell" for user apps. Visual Studio 2003 came with static and dynamic versions of the CRT library, as well as single- and multithreaded ones... Then everyone building apps that used DLLs or COM objects just had to make sure that nothing conflicted. And for every later version of Visual Studio they have released further more CRT versions, adding to the confusion. Currently: The official Python 2.7 binaries are built with Visual Studio 2008 and linked with msvcr90.dll. MinGW has import libraries for the other CRTs Microsoft has released, so just add -lmsvcr90 to your final linkage. Python's distutils will control the build process for extensions automatically. Adding -lmsvcr90 is one of the things that distutils will do. Sturla
On Thu, 2012-12-20 at 21:45 +0100, Sturla Molden wrote:
On 20.12.2012 21:24, Henry Gomersall wrote:
I didn't know that. It's a real pain having so many libc libs knocking around. I have little experience of Windows, as you may have guessed!
Originally there was only one system-wide CRT on Windows (msvcrt.dll), which is why MinGW linkes with that by default. But starting with the release of VS2003, Microsoft decided to reserve msvcrt.dll for system resources and create a libc "DLL Hell" for user apps. Visual Studio 2003 came with static and dynamic versions of the CRT library, as well as single- and multithreaded ones... Then everyone building apps that used DLLs or COM objects just had to make sure that nothing conflicted. And for every later version of Visual Studio they have released further more CRT versions, adding to the confusion.
Currently: The official Python 2.7 binaries are built with Visual Studio 2008 and linked with msvcr90.dll.
MinGW has import libraries for the other CRTs Microsoft has released, so just add -lmsvcr90 to your final linkage.
Python's distutils will control the build process for extensions automatically. Adding -lmsvcr90 is one of the things that distutils will do.
Well, I _am_ using distutils, so I should expect it to happen then. Probably my alignment concerns are based on some previous stuff when I was building pure C libs under Windows. Anyway, thanks for all the assistance! hen
On 20.12.2012 20:57, Sturla Molden wrote:
On 20.12.2012 20:52, Henry Gomersall wrote:
Perhaps the DLL should go and read MS's edicts!
Do you link with same same CRT as Python? (msvcr90.dll)
You should always use -lmsvcr90.
If you don't, you will link with msvcrt.dll.
Here is VS2008, which uses malloc from msvcr90.dll: http://msdn.microsoft.com/en-us/library/ycsb6wwf(v=vs.90).aspx "malloc is required to return memory on a 16-byte boundary" If this does not happen, you are linking with the wrong CRT. When building C extensions for Python, we should always link with the same CRT as Python uses, unless you are 100% certain that CRT resources are never shared with Python. Sturla
On Thu, 2012-12-20 at 21:05 +0100, Sturla Molden wrote:
On 20.12.2012 20:57, Sturla Molden wrote:
On 20.12.2012 20:52, Henry Gomersall wrote:
Perhaps the DLL should go and read MS's edicts!
Do you link with same same CRT as Python? (msvcr90.dll)
You should always use -lmsvcr90.
If you don't, you will link with msvcrt.dll.
Here is VS2008, which uses malloc from msvcr90.dll:
http://msdn.microsoft.com/en-us/library/ycsb6wwf(v=vs.90).aspx
"malloc is required to return memory on a 16-byte boundary"
If this does not happen, you are linking with the wrong CRT. When building C extensions for Python, we should always link with the same CRT as Python uses, unless you are 100% certain that CRT resources are never shared with Python.
Is this something that can be established when using Cython? I haven't experienced any problems so far (other than a now solved problem trying to free memory allocated in FFTW which was down to a runtime mismatch). Cheers, Henry
On Wed, Dec 19, 2012 at 6:03 PM, Francesc Alted <francesc@continuum.io> wrote:
On 12/19/12 5:47 PM, Henry Gomersall wrote:
Not sure which interface is more useful to users. On the one hand, using funny dtypes makes regular non-SIMD access more cumbersome, and it forces your array size to be a multiple of the SIMD word size, which might be inconvenient if your code is smart enough to handle arbitrary-sized arrays with partial SIMD acceleration (i.e., using SIMD for most of the array, and then a slow path to handle any partial word at the end). OTOH, if your code *is* that smart, you should probably just make it smart enough to handle a partial word at the beginning as well and then you won't need any special alignment in the first place, and representing each SIMD word as a single numpy scalar is an intuitively appealing model of how SIMD works. OTOOH, just adding a single argument np.array() is a much simpler to explain than some elaborate scheme involving the creation of special custom dtypes. If it helps, my use-case is in wrapping the FFTW library. This _is_ smart enough to deal with unaligned arrays, but it just results in a
On Wed, 2012-12-19 at 15:57 +0000, Nathaniel Smith wrote: performance penalty. In the case of an FFT, there are clearly going to be issues with the powers of two indices in the array not lying on a suitable n-byte boundary (which would be the case with a misaligned array), but I imagine it's not unique.
The other point is that it's easy to create a suitable power of two array that should always bypass any special case unaligned code (e.g. with floats, any multiple of 4 array length will fill every 16-byte word).
Finally, I think there is significant value in auto-aligning the array based on an appropriate inspection of the cpu capabilities (or alternatively, a function that reports back the appropriate SIMD alignment). Again, this makes it easier to wrap libraries that may function with any alignment, but benefit from optimum alignment.
Hmm, NumPy seems to return data blocks that are aligned to 16 bytes on systems (Linux and Mac OSX):
Only by accident, at least on linux. The pointers returned by the gnu libc malloc are at least 8 bytes aligned, but they may not be 16 bytes when you're above the threshold where mmap is used for malloc. The difference between aligned and unaligned ram <-> sse registers (e.g. movaps, movups) used to be significant. Don't know if that's still the case for recent CPUs. David
On Wed, 2012-12-19 at 19:03 +0100, Francesc Alted wrote:
The only scenario that I see that this would create unaligned arrays is for machines having AVX. But provided that the Intel architecture is making great strides in fetching unaligned data, I'd be surprised that the difference in performance would be even noticeable.
Can you tell us which difference in performance are you seeing for an AVX-aligned array and other that is not AVX-aligned? Just curious.
Further to this point, from an Intel article... http://software.intel.com/en-us/articles/practical-intel-avx-optimization-on... "Aligning data to vector length is always recommended. When using Intel SSE and Intel SSE2 instructions, loaded data should be aligned to 16 bytes. Similarly, to achieve best results use Intel AVX instructions on 32-byte vectors that are 32-byte aligned. The use of Intel AVX instructions on unaligned 32-byte vectors means that every second load will be across a cache-line split, since the cache line is 64 bytes. This doubles the cache line split rate compared to Intel SSE code that uses 16-byte vectors. A high cache-line split rate in memory-intensive code is extremely likely to cause performance degradation. For that reason, it is highly recommended to align the data to 32 bytes for use with Intel AVX." Though it would be nice to put together a little example of this! Henry
On 12/20/12 9:53 AM, Henry Gomersall wrote:
On Wed, 2012-12-19 at 19:03 +0100, Francesc Alted wrote:
The only scenario that I see that this would create unaligned arrays is for machines having AVX. But provided that the Intel architecture is making great strides in fetching unaligned data, I'd be surprised that the difference in performance would be even noticeable.
Can you tell us which difference in performance are you seeing for an AVX-aligned array and other that is not AVX-aligned? Just curious. Further to this point, from an Intel article...
http://software.intel.com/en-us/articles/practical-intel-avx-optimization-on...
"Aligning data to vector length is always recommended. When using Intel SSE and Intel SSE2 instructions, loaded data should be aligned to 16 bytes. Similarly, to achieve best results use Intel AVX instructions on 32-byte vectors that are 32-byte aligned. The use of Intel AVX instructions on unaligned 32-byte vectors means that every second load will be across a cache-line split, since the cache line is 64 bytes. This doubles the cache line split rate compared to Intel SSE code that uses 16-byte vectors. A high cache-line split rate in memory-intensive code is extremely likely to cause performance degradation. For that reason, it is highly recommended to align the data to 32 bytes for use with Intel AVX."
Though it would be nice to put together a little example of this!
Indeed, an example is what I was looking for. So provided that I have access to an AVX capable machine (having 6 physical cores), and that MKL 10.3 has support for AVX, I have made some comparisons using the Anaconda Python distribution (it ships with most packages linked against MKL 10.3). Here it is a first example using a DGEMM operation. First using a NumPy that is not turbo-loaded with MKL: In [34]: a = np.linspace(0,1,1e7) In [35]: b = a.reshape(1000, 10000) In [36]: c = a.reshape(10000, 1000) In [37]: time d = np.dot(b,c) CPU times: user 7.56 s, sys: 0.03 s, total: 7.59 s Wall time: 7.63 s In [38]: time d = np.dot(c,b) CPU times: user 78.52 s, sys: 0.18 s, total: 78.70 s Wall time: 78.89 s This is getting around 2.6 GFlop/s. Now, with a MKL 10.3 NumPy and AVX-unaligned data: In [7]: p = ctypes.create_string_buffer(int(8e7)); hex(ctypes.addressof(p)) Out[7]: '0x7fcdef3b4010' # 16 bytes alignment In [8]: a = np.ndarray(1e7, "f8", p) In [9]: a[:] = np.linspace(0,1,1e7) In [10]: b = a.reshape(1000, 10000) In [11]: c = a.reshape(10000, 1000) In [37]: %timeit d = np.dot(b,c) 10 loops, best of 3: 164 ms per loop In [38]: %timeit d = np.dot(c,b) 1 loops, best of 3: 1.65 s per loop That is around 120 GFlop/s (i.e. almost 50x faster than without MKL/AVX). Now, using MKL 10.3 and AVX-aligned data: In [21]: p2 = ctypes.create_string_buffer(int(8e7+16)); hex(ctypes.addressof(p)) Out[21]: '0x7f8cb9598010' In [22]: a2 = np.ndarray(1e7+2, "f8", p2)[2:] # skip the first 16 bytes (now is 32-bytes aligned) In [23]: a2[:] = np.linspace(0,1,1e7) In [24]: b2 = a2.reshape(1000, 10000) In [25]: c2 = a2.reshape(10000, 1000) In [35]: %timeit d2 = np.dot(b2,c2) 10 loops, best of 3: 163 ms per loop In [36]: %timeit d2 = np.dot(c2,b2) 1 loops, best of 3: 1.67 s per loop So, again, around 120 GFlop/s, and the difference wrt to unaligned AVX data is negligible. One may argue that DGEMM is CPU-bounded and that memory access plays little role here, and that is certainly true. So, let's go with a more memory-bounded problem, like computing a transcendental function with numexpr. First with a with NumPy and numexpr with no MKL support: In [8]: a = np.linspace(0,1,1e8) In [9]: %time b = np.sin(a) CPU times: user 1.20 s, sys: 0.22 s, total: 1.42 s Wall time: 1.42 s In [10]: import numexpr as ne In [12]: %time b = ne.evaluate("sin(a)") CPU times: user 1.42 s, sys: 0.27 s, total: 1.69 s Wall time: 0.37 s This time is around 4x faster than regular 'sin' in libc, and about the same speed than a memcpy(): In [13]: %time c = a.copy() CPU times: user 0.19 s, sys: 0.20 s, total: 0.39 s Wall time: 0.39 s Now, with a MKL-aware numexpr and non-AVX alignment: In [8]: p = ctypes.create_string_buffer(int(8e8)); hex(ctypes.addressof(p)) Out[8]: '0x7fce435da010' # 16 bytes alignment In [9]: a = np.ndarray(1e8, "f8", p) In [10]: a[:] = np.linspace(0,1,1e8) In [11]: %time b = ne.evaluate("sin(a)") CPU times: user 0.44 s, sys: 0.27 s, total: 0.71 s Wall time: 0.15 s That is, more than 2x faster than a memcpy() in this system, meaning that the problem is truly memory-bounded. So now, with an AVX aligned buffer: In [14]: a2 = a[2:] # skip the first 16 bytes In [15]: %time b = ne.evaluate("sin(a2)") CPU times: user 0.40 s, sys: 0.28 s, total: 0.69 s Wall time: 0.16 s Again, times are very close. Just to make sure, let's use the timeit magic: In [16]: %timeit b = ne.evaluate("sin(a)") 10 loops, best of 3: 159 ms per loop # unaligned In [17]: %timeit b = ne.evaluate("sin(a2)") 10 loops, best of 3: 154 ms per loop # aligned All in all, it is not clear that AVX alignment would have an advantage, even for memory-bounded problems. But of course, if Intel people are saying that AVX alignment is important is because they have use cases for asserting this. It is just that I'm having a difficult time to find these cases. -- Francesc Alted
On Thu, 2012-12-20 at 15:23 +0100, Francesc Alted wrote:
On 12/20/12 9:53 AM, Henry Gomersall wrote:
On Wed, 2012-12-19 at 19:03 +0100, Francesc Alted wrote:
The only scenario that I see that this would create unaligned arrays is for machines having AVX. But provided that the Intel architecture is making great strides in fetching unaligned data, I'd be surprised that the difference in performance would be even noticeable.
Can you tell us which difference in performance are you seeing for an AVX-aligned array and other that is not AVX-aligned? Just curious. Further to this point, from an Intel article...
http://software.intel.com/en-us/articles/practical-intel-avx-optimization-on...
"Aligning data to vector length is always recommended. When using
SSE and Intel SSE2 instructions, loaded data should be aligned to 16 bytes. Similarly, to achieve best results use Intel AVX instructions on 32-byte vectors that are 32-byte aligned. The use of Intel AVX instructions on unaligned 32-byte vectors means that every second load will be across a cache-line split, since the cache line is 64 bytes. This doubles the cache line split rate compared to Intel SSE code
Intel that
uses 16-byte vectors. A high cache-line split rate in memory-intensive code is extremely likely to cause performance degradation. For that reason, it is highly recommended to align the data to 32 bytes for use with Intel AVX."
Though it would be nice to put together a little example of this!
Indeed, an example is what I was looking for. So provided that I have access to an AVX capable machine (having 6 physical cores), and that MKL 10.3 has support for AVX, I have made some comparisons using the Anaconda Python distribution (it ships with most packages linked against MKL 10.3).
<snip>
All in all, it is not clear that AVX alignment would have an advantage, even for memory-bounded problems. But of course, if Intel people are saying that AVX alignment is important is because they have use cases for asserting this. It is just that I'm having a difficult time to find these cases.
Thanks for those examples, they were very interesting. I managed to temporarily get my hands on a machine with AVX and I have shown some speed-up with aligned arrays. FFT (using my wrappers) gives about a 15% speedup. Also this convolution code: https://github.com/hgomersall/SSE-convolution/blob/master/convolve.c Shows a small but repeatable speed-up (a few %) when using some aligned loads (as many as I can work out to use!). Cheers, Henry
On 12/20/12 7:35 PM, Henry Gomersall wrote:
On Thu, 2012-12-20 at 15:23 +0100, Francesc Alted wrote:
On 12/20/12 9:53 AM, Henry Gomersall wrote:
On Wed, 2012-12-19 at 19:03 +0100, Francesc Alted wrote:
The only scenario that I see that this would create unaligned arrays is for machines having AVX. But provided that the Intel architecture is making great strides in fetching unaligned data, I'd be surprised that the difference in performance would be even noticeable.
Can you tell us which difference in performance are you seeing for an AVX-aligned array and other that is not AVX-aligned? Just curious. Further to this point, from an Intel article...
"Aligning data to vector length is always recommended. When using Intel SSE and Intel SSE2 instructions, loaded data should be aligned to 16 bytes. Similarly, to achieve best results use Intel AVX instructions on 32-byte vectors that are 32-byte aligned. The use of Intel AVX instructions on unaligned 32-byte vectors means that every second load will be across a cache-line split, since the cache line is 64 bytes. This doubles the cache line split rate compared to Intel SSE code
http://software.intel.com/en-us/articles/practical-intel-avx-optimization-on... that
uses 16-byte vectors. A high cache-line split rate in memory-intensive code is extremely likely to cause performance degradation. For that reason, it is highly recommended to align the data to 32 bytes for use with Intel AVX."
Though it would be nice to put together a little example of this! Indeed, an example is what I was looking for. So provided that I have access to an AVX capable machine (having 6 physical cores), and that MKL 10.3 has support for AVX, I have made some comparisons using the Anaconda Python distribution (it ships with most packages linked against MKL 10.3). <snip>
All in all, it is not clear that AVX alignment would have an advantage, even for memory-bounded problems. But of course, if Intel people are saying that AVX alignment is important is because they have use cases for asserting this. It is just that I'm having a difficult time to find these cases. Thanks for those examples, they were very interesting. I managed to temporarily get my hands on a machine with AVX and I have shown some speed-up with aligned arrays.
FFT (using my wrappers) gives about a 15% speedup.
Also this convolution code: https://github.com/hgomersall/SSE-convolution/blob/master/convolve.c
Shows a small but repeatable speed-up (a few %) when using some aligned loads (as many as I can work out to use!).
Okay, so a 15% is significant, yes. I'm still wondering why I did not get any speedup at all using MKL, but probably the reason is that it manages the unaligned corners of the datasets first, and then uses an aligned access for the rest of the data (but just guessing here). -- Francesc Alted
On Fri, 2012-12-21 at 11:34 +0100, Francesc Alted wrote:
Also this convolution code: https://github.com/hgomersall/SSE-convolution/blob/master/convolve.c
Shows a small but repeatable speed-up (a few %) when using some aligned loads (as many as I can work out to use!).
Okay, so a 15% is significant, yes. I'm still wondering why I did not get any speedup at all using MKL, but probably the reason is that it manages the unaligned corners of the datasets first, and then uses an aligned access for the rest of the data (but just guessing here).
With SSE in that convolution code example above (in which all alignments need be considered for each output element), I note a significant speedup by creating 4 copies of the float input array using memcopy, each shifted by 1 float (so the 5th element is aligned again). Despite all the extra copies its still quicker than using an unaligned load. However, when one tries the same trick with 8 copies for AVX it's actually slower than the SSE case. The fastest AVX (and any) implementation I have so far is with 16-aligned arrays (made with 4 copies as with SSE), with alternate aligned and unaligned loads (which is always at worst 16-byte aligned). Fascinating stuff! hen
On 12/21/12 11:58 AM, Henry Gomersall wrote:
On Fri, 2012-12-21 at 11:34 +0100, Francesc Alted wrote:
Also this convolution code: https://github.com/hgomersall/SSE-convolution/blob/master/convolve.c
Shows a small but repeatable speed-up (a few %) when using some aligned loads (as many as I can work out to use!). Okay, so a 15% is significant, yes. I'm still wondering why I did not get any speedup at all using MKL, but probably the reason is that it manages the unaligned corners of the datasets first, and then uses an aligned access for the rest of the data (but just guessing here). With SSE in that convolution code example above (in which all alignments need be considered for each output element), I note a significant speedup by creating 4 copies of the float input array using memcopy, each shifted by 1 float (so the 5th element is aligned again). Despite all the extra copies its still quicker than using an unaligned load. However, when one tries the same trick with 8 copies for AVX it's actually slower than the SSE case.
The fastest AVX (and any) implementation I have so far is with 16-aligned arrays (made with 4 copies as with SSE), with alternate aligned and unaligned loads (which is always at worst 16-byte aligned).
Fascinating stuff!
Yes, to say the least. And it supports the fact that, when fine tuning memory access performance, there is no replacement for experimentation (in some weird ways many times :) -- Francesc Alted
On 12/20/2012 03:23 PM, Francesc Alted wrote:
On 12/20/12 9:53 AM, Henry Gomersall wrote:
On Wed, 2012-12-19 at 19:03 +0100, Francesc Alted wrote:
The only scenario that I see that this would create unaligned arrays is for machines having AVX. But provided that the Intel architecture is making great strides in fetching unaligned data, I'd be surprised that the difference in performance would be even noticeable.
Can you tell us which difference in performance are you seeing for an AVX-aligned array and other that is not AVX-aligned? Just curious. Further to this point, from an Intel article...
http://software.intel.com/en-us/articles/practical-intel-avx-optimization-on...
"Aligning data to vector length is always recommended. When using Intel SSE and Intel SSE2 instructions, loaded data should be aligned to 16 bytes. Similarly, to achieve best results use Intel AVX instructions on 32-byte vectors that are 32-byte aligned. The use of Intel AVX instructions on unaligned 32-byte vectors means that every second load will be across a cache-line split, since the cache line is 64 bytes. This doubles the cache line split rate compared to Intel SSE code that uses 16-byte vectors. A high cache-line split rate in memory-intensive code is extremely likely to cause performance degradation. For that reason, it is highly recommended to align the data to 32 bytes for use with Intel AVX."
Though it would be nice to put together a little example of this!
Indeed, an example is what I was looking for. So provided that I have access to an AVX capable machine (having 6 physical cores), and that MKL 10.3 has support for AVX, I have made some comparisons using the Anaconda Python distribution (it ships with most packages linked against MKL 10.3).
Here it is a first example using a DGEMM operation. First using a NumPy that is not turbo-loaded with MKL:
In [34]: a = np.linspace(0,1,1e7)
In [35]: b = a.reshape(1000, 10000)
In [36]: c = a.reshape(10000, 1000)
In [37]: time d = np.dot(b,c) CPU times: user 7.56 s, sys: 0.03 s, total: 7.59 s Wall time: 7.63 s
In [38]: time d = np.dot(c,b) CPU times: user 78.52 s, sys: 0.18 s, total: 78.70 s Wall time: 78.89 s
This is getting around 2.6 GFlop/s. Now, with a MKL 10.3 NumPy and AVX-unaligned data:
In [7]: p = ctypes.create_string_buffer(int(8e7)); hex(ctypes.addressof(p)) Out[7]: '0x7fcdef3b4010' # 16 bytes alignment
In [8]: a = np.ndarray(1e7, "f8", p)
In [9]: a[:] = np.linspace(0,1,1e7)
In [10]: b = a.reshape(1000, 10000)
In [11]: c = a.reshape(10000, 1000)
In [37]: %timeit d = np.dot(b,c) 10 loops, best of 3: 164 ms per loop
In [38]: %timeit d = np.dot(c,b) 1 loops, best of 3: 1.65 s per loop
That is around 120 GFlop/s (i.e. almost 50x faster than without MKL/AVX).
Now, using MKL 10.3 and AVX-aligned data:
In [21]: p2 = ctypes.create_string_buffer(int(8e7+16)); hex(ctypes.addressof(p)) Out[21]: '0x7f8cb9598010'
In [22]: a2 = np.ndarray(1e7+2, "f8", p2)[2:] # skip the first 16 bytes (now is 32-bytes aligned)
In [23]: a2[:] = np.linspace(0,1,1e7)
In [24]: b2 = a2.reshape(1000, 10000)
In [25]: c2 = a2.reshape(10000, 1000)
In [35]: %timeit d2 = np.dot(b2,c2) 10 loops, best of 3: 163 ms per loop
In [36]: %timeit d2 = np.dot(c2,b2) 1 loops, best of 3: 1.67 s per loop
So, again, around 120 GFlop/s, and the difference wrt to unaligned AVX data is negligible.
One may argue that DGEMM is CPU-bounded and that memory access plays little role here, and that is certainly true. So, let's go with a more memory-bounded problem, like computing a transcendental function with numexpr. First with a with NumPy and numexpr with no MKL support:
In [8]: a = np.linspace(0,1,1e8)
In [9]: %time b = np.sin(a) CPU times: user 1.20 s, sys: 0.22 s, total: 1.42 s Wall time: 1.42 s
In [10]: import numexpr as ne
In [12]: %time b = ne.evaluate("sin(a)") CPU times: user 1.42 s, sys: 0.27 s, total: 1.69 s Wall time: 0.37 s
This time is around 4x faster than regular 'sin' in libc, and about the same speed than a memcpy():
In [13]: %time c = a.copy() CPU times: user 0.19 s, sys: 0.20 s, total: 0.39 s Wall time: 0.39 s
Now, with a MKL-aware numexpr and non-AVX alignment:
In [8]: p = ctypes.create_string_buffer(int(8e8)); hex(ctypes.addressof(p)) Out[8]: '0x7fce435da010' # 16 bytes alignment
In [9]: a = np.ndarray(1e8, "f8", p)
In [10]: a[:] = np.linspace(0,1,1e8)
In [11]: %time b = ne.evaluate("sin(a)") CPU times: user 0.44 s, sys: 0.27 s, total: 0.71 s Wall time: 0.15 s
That is, more than 2x faster than a memcpy() in this system, meaning that the problem is truly memory-bounded. So now, with an AVX aligned buffer:
In [14]: a2 = a[2:] # skip the first 16 bytes
In [15]: %time b = ne.evaluate("sin(a2)") CPU times: user 0.40 s, sys: 0.28 s, total: 0.69 s Wall time: 0.16 s
Again, times are very close. Just to make sure, let's use the timeit magic:
In [16]: %timeit b = ne.evaluate("sin(a)") 10 loops, best of 3: 159 ms per loop # unaligned
In [17]: %timeit b = ne.evaluate("sin(a2)") 10 loops, best of 3: 154 ms per loop # aligned
All in all, it is not clear that AVX alignment would have an advantage, even for memory-bounded problems. But of course, if Intel people are saying that AVX alignment is important is because they have use cases for asserting this. It is just that I'm having a difficult time to find these cases.
Hmm, I think it is the opposite, that it is for CPU-bound problems that alignment would have an effect? I.e. the MOVUPD would be doing some shuffling etc. to get around the non-alignment, which only matters if the data is already in cache. (There are other instructions, like the STREAM instructions and the direct writes and so on, which are much more important for the non-cached case. At least that's my understanding.) Dag Sverre
On 12/21/12 1:35 PM, Dag Sverre Seljebotn wrote:
On 12/20/12 9:53 AM, Henry Gomersall wrote:
On Wed, 2012-12-19 at 19:03 +0100, Francesc Alted wrote:
The only scenario that I see that this would create unaligned arrays is for machines having AVX. But provided that the Intel architecture is making great strides in fetching unaligned data, I'd be surprised that the difference in performance would be even noticeable.
Can you tell us which difference in performance are you seeing for an AVX-aligned array and other that is not AVX-aligned? Just curious. Further to this point, from an Intel article...
http://software.intel.com/en-us/articles/practical-intel-avx-optimization-on...
"Aligning data to vector length is always recommended. When using Intel SSE and Intel SSE2 instructions, loaded data should be aligned to 16 bytes. Similarly, to achieve best results use Intel AVX instructions on 32-byte vectors that are 32-byte aligned. The use of Intel AVX instructions on unaligned 32-byte vectors means that every second load will be across a cache-line split, since the cache line is 64 bytes. This doubles the cache line split rate compared to Intel SSE code that uses 16-byte vectors. A high cache-line split rate in memory-intensive code is extremely likely to cause performance degradation. For that reason, it is highly recommended to align the data to 32 bytes for use with Intel AVX."
Though it would be nice to put together a little example of this! Indeed, an example is what I was looking for. So provided that I have access to an AVX capable machine (having 6 physical cores), and that MKL 10.3 has support for AVX, I have made some comparisons using the Anaconda Python distribution (it ships with most packages linked against MKL 10.3).
Here it is a first example using a DGEMM operation. First using a NumPy that is not turbo-loaded with MKL:
In [34]: a = np.linspace(0,1,1e7)
In [35]: b = a.reshape(1000, 10000)
In [36]: c = a.reshape(10000, 1000)
In [37]: time d = np.dot(b,c) CPU times: user 7.56 s, sys: 0.03 s, total: 7.59 s Wall time: 7.63 s
In [38]: time d = np.dot(c,b) CPU times: user 78.52 s, sys: 0.18 s, total: 78.70 s Wall time: 78.89 s
This is getting around 2.6 GFlop/s. Now, with a MKL 10.3 NumPy and AVX-unaligned data:
In [7]: p = ctypes.create_string_buffer(int(8e7)); hex(ctypes.addressof(p)) Out[7]: '0x7fcdef3b4010' # 16 bytes alignment
In [8]: a = np.ndarray(1e7, "f8", p)
In [9]: a[:] = np.linspace(0,1,1e7)
In [10]: b = a.reshape(1000, 10000)
In [11]: c = a.reshape(10000, 1000)
In [37]: %timeit d = np.dot(b,c) 10 loops, best of 3: 164 ms per loop
In [38]: %timeit d = np.dot(c,b) 1 loops, best of 3: 1.65 s per loop
That is around 120 GFlop/s (i.e. almost 50x faster than without MKL/AVX).
Now, using MKL 10.3 and AVX-aligned data:
In [21]: p2 = ctypes.create_string_buffer(int(8e7+16)); hex(ctypes.addressof(p)) Out[21]: '0x7f8cb9598010'
In [22]: a2 = np.ndarray(1e7+2, "f8", p2)[2:] # skip the first 16 bytes (now is 32-bytes aligned)
In [23]: a2[:] = np.linspace(0,1,1e7)
In [24]: b2 = a2.reshape(1000, 10000)
In [25]: c2 = a2.reshape(10000, 1000)
In [35]: %timeit d2 = np.dot(b2,c2) 10 loops, best of 3: 163 ms per loop
In [36]: %timeit d2 = np.dot(c2,b2) 1 loops, best of 3: 1.67 s per loop
So, again, around 120 GFlop/s, and the difference wrt to unaligned AVX data is negligible.
One may argue that DGEMM is CPU-bounded and that memory access plays little role here, and that is certainly true. So, let's go with a more memory-bounded problem, like computing a transcendental function with numexpr. First with a with NumPy and numexpr with no MKL support:
In [8]: a = np.linspace(0,1,1e8)
In [9]: %time b = np.sin(a) CPU times: user 1.20 s, sys: 0.22 s, total: 1.42 s Wall time: 1.42 s
In [10]: import numexpr as ne
In [12]: %time b = ne.evaluate("sin(a)") CPU times: user 1.42 s, sys: 0.27 s, total: 1.69 s Wall time: 0.37 s
This time is around 4x faster than regular 'sin' in libc, and about the same speed than a memcpy():
In [13]: %time c = a.copy() CPU times: user 0.19 s, sys: 0.20 s, total: 0.39 s Wall time: 0.39 s
Now, with a MKL-aware numexpr and non-AVX alignment:
In [8]: p = ctypes.create_string_buffer(int(8e8)); hex(ctypes.addressof(p)) Out[8]: '0x7fce435da010' # 16 bytes alignment
In [9]: a = np.ndarray(1e8, "f8", p)
In [10]: a[:] = np.linspace(0,1,1e8)
In [11]: %time b = ne.evaluate("sin(a)") CPU times: user 0.44 s, sys: 0.27 s, total: 0.71 s Wall time: 0.15 s
That is, more than 2x faster than a memcpy() in this system, meaning that the problem is truly memory-bounded. So now, with an AVX aligned buffer:
In [14]: a2 = a[2:] # skip the first 16 bytes
In [15]: %time b = ne.evaluate("sin(a2)") CPU times: user 0.40 s, sys: 0.28 s, total: 0.69 s Wall time: 0.16 s
Again, times are very close. Just to make sure, let's use the timeit magic:
In [16]: %timeit b = ne.evaluate("sin(a)") 10 loops, best of 3: 159 ms per loop # unaligned
In [17]: %timeit b = ne.evaluate("sin(a2)") 10 loops, best of 3: 154 ms per loop # aligned
All in all, it is not clear that AVX alignment would have an advantage, even for memory-bounded problems. But of course, if Intel people are saying that AVX alignment is important is because they have use cases for asserting this. It is just that I'm having a difficult time to find these cases. Hmm, I think it is the opposite, that it is for CPU-bound problems that alignment would have an effect? I.e. the MOVUPD would be doing some shuffling etc. to get around the non-alignment, which only matters if
On 12/20/2012 03:23 PM, Francesc Alted wrote: the data is already in cache.
(There are other instructions, like the STREAM instructions and the direct writes and so on, which are much more important for the non-cached case. At least that's my understanding.)
Yes, I think you are right. It is just that I was a bit disappointed with the DGEMM not being affected by non-AVX alignment and tried a memory-bound problem, just in case. But as I said before, probably Intel people have dealt with both aligned and unaligned data. -- Francesc Alted
On Wed, 2012-12-19 at 15:10 +0000, Nathaniel Smith wrote: <snip>
Is this something that can be rolled into Numpy (the feature, not my particular implementation or interface - though I'd be happy for it to be so)?
Regarding (b), I've written a test case that works for Linux on x86-64 with GCC (my platform!). I can test it on 32-bit windows, but that's it. Is ARM supported by Numpy? Neon would be great to include as well. What other platforms might need this?
Your code looks simple and portable to me (at least the alignment part). I can see a good argument for adding this sort of functionality directly to numpy with a nice interface, though, since these kind of requirements seem quite common these days. Maybe an interface like a = np.asarray([1, 2, 3], base_alignment=32) # should this be in bits or in bytes? b = np.empty((10, 10), order="C", base_alignment=32) # etc. assert a.base_alignment == 32 which underneath tries to use posix_memalign/_aligned_malloc when possible, or falls back on the overallocation trick otherwise?
There is a thread about this from several years back. IIRC, David Cournapeau was interested in the same problem. At first glance, the alignment keyword looks interesting. One possible concern is keeping alignment for rows, views, etc., which is probably not possible in any sensible way. But people who need this most likely know what they are doing and just need memory allocated on the proper boundary.
Right, my intuition is that it's like order="C" -- if you make a new array by, say, indexing, then it may or may not have order="C", no guarantees. So when you care, you call asarray(a, order="C") and that either makes a copy or not as needed. Similarly for base alignment.
I guess to push this analogy even further we could define a set of array flags, ALIGNED_8, ALIGNED_16, etc. (In practice only power-of-2 alignment matters, I think, so the number of flags would remain manageable?) That would make the C API easier to deal with too, no need to add PyArray_FromAnyAligned.
So, if I were to implement this, I presume the proper way would be through modifications to multiarray? Would this basic description be a reasonable initial target? Henry
On Thu, 2012-12-20 at 08:12 +0000, Henry Gomersall wrote:
Is this something that can be rolled into Numpy (the feature, not my particular implementation or interface - though I'd be happy for it to be so)?
Regarding (b), I've written a test case that works for Linux on x86-64 with GCC (my platform!). I can test it on 32-bit windows, but that's it. Is ARM supported by Numpy? Neon would be great to include as well. What other platforms might need this?
Your code looks simple and portable to me (at least the alignment part). I can see a good argument for adding this sort of functionality directly to numpy with a nice interface, though, since these kind of requirements seem quite common these days. Maybe an interface
a = np.asarray([1, 2, 3], base_alignment=32) # should this be in bits or in bytes? b = np.empty((10, 10), order="C", base_alignment=32) # etc. assert a.base_alignment == 32 which underneath tries to use posix_memalign/_aligned_malloc when possible, or falls back on the overallocation trick otherwise?
There is a thread about this from several years back. IIRC, David Cournapeau was interested in the same problem. At first glance, the alignment keyword looks interesting. One possible concern is keeping alignment for rows, views, etc., which is probably not possible in any sensible way. But people who need this most likely know what they are doing and just need memory allocated on the proper boundary.
Right, my intuition is that it's like order="C" -- if you make a new array by, say, indexing, then it may or may not have order="C", no guarantees. So when you care, you call asarray(a, order="C") and
either makes a copy or not as needed. Similarly for base alignment.
I guess to push this analogy even further we could define a set of array flags, ALIGNED_8, ALIGNED_16, etc. (In practice only
On Wed, 2012-12-19 at 15:10 +0000, Nathaniel Smith wrote: <snip> like that power-of-2
alignment matters, I think, so the number of flags would remain manageable?) That would make the C API easier to deal with too, no need to add PyArray_FromAnyAligned.
So, if I were to implement this, I presume the proper way would be through modifications to multiarray?
Would this basic description be a reasonable initial target?
There is this patch: http://projects.scipy.org/numpy/attachment/ticket/568/aligned_v1.patch Which, rather amusingly to me, was written by Steven Johnson of FFTW. It looks like a good starting point. Cheers, Henry
On Thu, Dec 20, 2012 at 8:12 AM, Henry Gomersall <heng@cantab.net> wrote:
On Wed, 2012-12-19 at 15:10 +0000, Nathaniel Smith wrote: <snip>
Is this something that can be rolled into Numpy (the feature, not my particular implementation or interface - though I'd be happy for it to be so)?
Regarding (b), I've written a test case that works for Linux on x86-64 with GCC (my platform!). I can test it on 32-bit windows, but that's it. Is ARM supported by Numpy? Neon would be great to include as well. What other platforms might need this?
Your code looks simple and portable to me (at least the alignment part). I can see a good argument for adding this sort of functionality directly to numpy with a nice interface, though, since these kind of requirements seem quite common these days. Maybe an interface like a = np.asarray([1, 2, 3], base_alignment=32) # should this be in bits or in bytes? b = np.empty((10, 10), order="C", base_alignment=32) # etc. assert a.base_alignment == 32 which underneath tries to use posix_memalign/_aligned_malloc when possible, or falls back on the overallocation trick otherwise?
There is a thread about this from several years back. IIRC, David Cournapeau was interested in the same problem. At first glance, the alignment keyword looks interesting. One possible concern is keeping alignment for rows, views, etc., which is probably not possible in any sensible way. But people who need this most likely know what they are doing and just need memory allocated on the proper boundary.
Right, my intuition is that it's like order="C" -- if you make a new array by, say, indexing, then it may or may not have order="C", no guarantees. So when you care, you call asarray(a, order="C") and that either makes a copy or not as needed. Similarly for base alignment.
I guess to push this analogy even further we could define a set of array flags, ALIGNED_8, ALIGNED_16, etc. (In practice only power-of-2 alignment matters, I think, so the number of flags would remain manageable?) That would make the C API easier to deal with too, no need to add PyArray_FromAnyAligned.
So, if I were to implement this, I presume the proper way would be through modifications to multiarray?
Yes, numpy/core/src/multiarray/ is the code you'd be modifying.
Would this basic description be a reasonable initial target?
My feeling is that we're at the stage where we need to get more information and feedback from people with experience in this area before we'll be able to merge anything into numpy proper (since that implies nailing down and committing to an API). One way to get that might be to go ahead and implement something to experiment with, and this "basic description" does seem like one of the plausible options, so... yes it seems like a reasonable initial target to work on, but I don't want to mislead you into thinking that I think it would necessarily be a reasonable initial target to ship in 1.8 or whatever. I feel like I don't have enough information to make a judgement there. There must be other people working with SIMD and numpy, right? If you're interested in this problem, another thing that might help would be to spend some effort finding those people and convincing them to get involved in discussing what they need. -n
On 19.12.2012 09:40, Henry Gomersall wrote:
I've written a few simple cython routines for assisting in creating byte-aligned numpy arrays. The point being for the arrays to work with SSE/AVX code.
https://github.com/hgomersall/pyFFTW/blob/master/pyfftw/utils.pxi
Why use Cython? http://mail.scipy.org/pipermail/scipy-user/2009-March/020289.html def aligned_zeros(shape, boundary=16, dtype=float, order='C'): N = np.prod(shape) d = np.dtype(dtype) tmp = np.zeros(N * d.itemsize + boundary, dtype=np.uint8) address = tmp.__array_interface__['data'][0] offset = (boundary - address % boundary) % boundary return tmp[offset:offset+N]\ .view(dtype=d)\ .reshape(shape, order=order) Sturla
On Thu, 2012-12-20 at 17:26 +0100, Sturla Molden wrote:
On 19.12.2012 09:40, Henry Gomersall wrote:
I've written a few simple cython routines for assisting in creating byte-aligned numpy arrays. The point being for the arrays to work with SSE/AVX code.
https://github.com/hgomersall/pyFFTW/blob/master/pyfftw/utils.pxi
Why use Cython?
http://mail.scipy.org/pipermail/scipy-user/2009-March/020289.html
def aligned_zeros(shape, boundary=16, dtype=float, order='C'): N = np.prod(shape) d = np.dtype(dtype) tmp = np.zeros(N * d.itemsize + boundary, dtype=np.uint8) address = tmp.__array_interface__['data'][0] offset = (boundary - address % boundary) % boundary return tmp[offset:offset+N]\ .view(dtype=d)\ .reshape(shape, order=order)
Initially because it kept my module in a single file. That's legacy now, but since I'm already in the Cython domain, it makes sense to get the advantages (like speed - creating a 1000 length array with n_byte_align_empty is about 7 times faster than with the code above). The alignment functions is just a utility function for the FFTW wrapper. Cheers, Henry
On 20.12.2012 17:47, Henry Gomersall wrote:
On Thu, 2012-12-20 at 17:26 +0100, Sturla Molden wrote:
return tmp[offset:offset+N]\ .view(dtype=d)\ .reshape(shape, order=order)
Also, just for the email record, that should be
return tmp[offset:offset+N*d.itemsize]\ .view(dtype=d)\ .reshape(shape, order=order)
Oops, yes that's right :) Sturla
participants (7)
-
Charles R Harris -
Dag Sverre Seljebotn -
David Cournapeau -
Francesc Alted -
Henry Gomersall -
Nathaniel Smith -
Sturla Molden