Introductory mail and GSoc Project "Vector math library integration"
Hi all, I am a CS 3rd Undergrad. Student from an Indian Institute (III T). I believe I am good in Programming languages like C/C++, Python as I have already done Some Projects using these language as a part of my academics. I really like Coding (Competitive as well as development). I really want to get involved in Numpy Development Project and want to take "Vector math library integration" as a part of my project. I want to here any idea from your side for this project. Thanks For your time for reading this email and responding back. My IRCnickname: dp Real Name: Durgesh Pandey.
There are several vector math libraries NumPy could use, e.g. MKL/VML, Apple Accelerate (vecLib), ACML, and probably others. They all suffer from requiring dense arrays and specific array alignments, whereas NumPy arrays have very flexible strides and flexible alignment. NumPy also has ufuncs and gufuncs as a complicating factor. There are at least two ways to proceed here. One is to only use vector math when strides and alignment allow it. The other is to build a vector math library specifically for NumPy arrays and (g)ufuncs. The latter you will most likely not be able to do in a summer. You should also consider Numba and Numexpr. They have some support for vector math libraries too. Sturla On 08/03/15 21:47, Dp Docs wrote:
Hi all, I am a CS 3rd Undergrad. Student from an Indian Institute (III T). I believe I am good in Programming languages like C/C++, Python as I have already done Some Projects using these language as a part of my academics. I really like Coding (Competitive as well as development). I really want to get involved in Numpy Development Project and want to take "Vector math library integration" as a part of my project. I want to here any idea from your side for this project. Thanks For your time for reading this email and responding back.
My IRCnickname: dp
Real Name: Durgesh Pandey.
On Wed, Mar 11, 2015 at 7:52 PM, Sturla Molden <sturla.molden@gmail.com> wrote:
Hi sturla, Thanks for suggestion.
There are several vector math libraries NumPy could use, e.g. MKL/VML, Apple Accelerate (vecLib), ACML, and probably others.
Are these libraries fast enough in comparison to C maths libraries?
They all suffer from requiring dense arrays and specific array alignments, whereas NumPy arrays have very flexible strides and flexible alignment. > NumPy also has ufuncs and gufuncs as a complicating factor. I don't think the project is supposed to modify the existing functionality as whenever the Faster libraries will be unavailable, it should use the default libraries.
There are at least two ways to proceed here. One is to only use vector math when strides and alignment allow it. I didn't got it. can you explain in detail? > The other is to build a vector math library specifically for NumPy arrays and (g)ufuncs. The latter you will most likely not be able to do in a summer. I have also came up with this approach but I am confused a bit with this approach.
You should also consider Numba and Numexpr. They have some support for vector math libraries too.
I will look into this. I think the actual problem is not "to choose which library to integrate", it is how to integrate these libraries? as I have seen the code base and been told the current implementation uses the c math library, Can we just use the current implementation and whenever it is calling C Maths functions, we will replace by these above fast library functions? Then we have to modify the Numpy library (which usually get imported for maths operation) by using some if else conditions like first work with the faster one and if it is not available the look for the Default one. Moreover, I have Another Doubt also. are we suppose to integrate just one fast library or more than one so that if one is not available, look for the second one and if second is not available then either go to default are look for the third one if available? Are we suppose to think like this: Let say "exp" is faster in sleef library so integrate sleef library for this operation and let say "sin" is faster in any other library, so integrate that library for sin operation? I mean, it may be possible that different operations are faster in different libraries So the implementation should be operation oriented or just integrate one complete library? Thanks -- Durgesh Pandey, IIIT-Hyderabad,India.
On 11 March 2015 at 16:51, Dp Docs <sdpan21@gmail.com> wrote:
On Wed, Mar 11, 2015 at 7:52 PM, Sturla Molden <sturla.molden@gmail.com> wrote:
Hi sturla, Thanks for suggestion.
There are several vector math libraries NumPy could use, e.g. MKL/VML, Apple Accelerate (vecLib), ACML, and probably others.
Are these libraries fast enough in comparison to C maths libraries?
These are the fastest beast out there, written in C, Assembly, and arcane incantations.
There are at least two ways to proceed here. One is to only use vector math when strides and alignment allow it. I didn't got it. can you explain in detail?
One example, you can create a numpy 2D array using only the odd columns of a matrix. odd_matrix = full_matrix[::2, ::2] This is just a view of the original data, so you save the time and the memory of making a copy. The drawback is that you trash memory locality, as the elements are not contiguous in memory. If the memory is guaranteed to be contiguous, a compiler can apply extra optimisations, and this is what vector libraries usually assume. What I think Sturla is suggesting with "when strides and aligment allow it" is to use the fast version if the array is contiguous, and fall back to the present implementation otherwise. Another would be to make an optimally aligned copy, but that could eat up whatever time we save from using the faster library, and other problems. The difficulty with Numpy's strides is that they allow so many ways of manipulating the data... (alternating elements, transpositions, different precisions...).
I think the actual problem is not "to choose which library to integrate", it is how to integrate these libraries? as I have seen the code base and been told the current implementation uses the c math library, Can we just use the current implementation and whenever it is calling C Maths functions, we will replace by these above fast library functions?Then we have to modify the Numpy library (which usually get imported for maths operation) by using some if else conditions like first work with the faster one and if it is not available the look for the Default one.
At the moment, we are linking to whichever LAPACK is avaliable at compile time, so no need for a runtime check. I guess it could (should?) be the same.
Moreover, I have Another Doubt also. are we suppose to integrate just one fast library or more than one so that if one is not available, look for the second one and if second is not available then either go to default are look for the third one if available? Are we suppose to think like this: Let say "exp" is faster in sleef library so integrate sleef library for this operation and let say "sin" is faster in any other library, so integrate that library for sin operation? I mean, it may be possible that different operations are faster in different libraries So the implementation should be operation oriented or just integrate one complete library?Thanks
Which one is faster depends on the hardware, the version of the library, and even the size of the problem: http://s3.postimg.org/wz0eis1o3/single.png I don't think you can reliably decide ahead of time which one should go for each operation. But, on the other hand, whichever one you go for will probably be fast enough for anyone using Python. Most of the work here is adapting Numpy's machinery to dispatch a call to the vector library, once that is ready, adding another one will hopefully be easier. At least, at the moment Numpy can use one of several linear algebra packages (MKL, ATLAS, CBLAS...) and they are added, I think, without too much pain (but maybe I am just far away from the screams of whoever did it). /David.
On Thu, Mar 12, 2015 at 2:01 AM, Daπid <davidmenhur@gmail.com> wrote:
On 11 March 2015 at 16:51, Dp Docs <sdpan21@gmail.com> wrote:
On Wed, Mar 11, 2015 at 7:52 PM, Sturla Molden <sturla.molden@gmail.com>
There are at least two ways to proceed here. One is to only use vector math when strides and alignment allow it.
I didn't got it. can you explain in detail?
One example, you can create a numpy 2D array using only the odd columns of a matrix.
odd_matrix = full_matrix[::2, ::2]
This is just a view of the original data, so you save the time and the memory of making a copy. The drawback is that you trash > memory locality, as the elements are not contiguous in memory. If the memory is guaranteed to be contiguous, a compiler can apply > extra optimisations, and this is what vector libraries usually assume. What I think Sturla is suggesting with "when strides and aligment > allow it" is to use the fast version if the array is contiguous, and fall back to the present implementation otherwise. Another would be to > make an optimally aligned copy, but that could eat up whatever time we save from using the faster library, and other problems.
The difficulty with Numpy's strides is that they allow so many ways of manipulating the data... (alternating elements, transpositions, different
wrote: precisions...).
I think the actual problem is not "to choose which library to
integrate", it is how to integrate these libraries? as I have seen the code >> base and been told the current implementation uses the c math library, Can we just use the current implementation and whenever it >> is calling C Maths functions, we will replace by these above fast library functions?Then we have to modify the Numpy library (which >> usually get imported for maths operation) by using some if else conditions like first work with the faster one and if it is not available >> the look for the Default one.
At the moment, we are linking to whichever LAPACK is avaliable at compile
time, so no need for a runtime check. I guess it could > (should?) be the same. I didn't understand this. I was asking about let say I have chosen one faster library, now I need to integrate this in *some way *without changing the default functionality so that when Numpy will import "from numpy import *",it should be able to access the integrated libraries functions as well as default libraries functions, What should we be that* some way*? Even at the Compile, it need to decide that which Function it is going to use, right? It have been discussed above about integration of MKL libraries but when MKL is not available on the hardware Architecture, will the above library support as default library? if yes, then the Above discussed integration method may be the required one for integration in this project, right? Can you please tell me a bit more or provide some link related to that? Availability of these faster Libraries depends on the Hardware Architectures etc. or availability of hardware Resources in a System? because if it is later one, this newly integrated library will support operations some time while sometimes not? I believe it's the first one but it is better to clear any type of confusion. For example, assuming availability of Hardware means later one, let say if library A needed the A1 for it's support and A1 is busy then it will not be able to support the operation. Meanwhile, library B, needs Support of hardware type B1 , and it's not Busy then it will support these operations. What I want to say is Assuming the Availability of faster lib. means availability of hardware Resources in a System at a particular time when we want to do operation, it's totally unpredictable and Availability of these resources will be Random and even worse, if it take a bit extra time between compile and running, and that h/d resource have been allocated to other process in the meantime then it would be very problematic to use these operations. So this leads to think that Availability of lib. means type of h/d architecture whether it supports or not that lib. Since there are many kind of h/d architecture and it is not the case that one library support all these architectures (though it may be), So we need to integrate more than one lib. for providing support to all kind of architecture (in ideal case which will make it to be a very big project).
Moreover, I have Another Doubt also. are we suppose to integrate just
Are we suppose to think like this: Let say "exp" is faster in sleef
one fast library or more than one so that if one is not available, look for the second one and if second is not available then either go to default are look for the third one if available? library so integrate sleef library for this operation and let say "sin" is faster in any other library, so integrate that library for sin operation? I mean, it may be possible that different operations are faster in different libraries So the implementation should be operation oriented or just integrate one complete library?Thanks
Which one is faster depends on the hardware, the version of the library,
http://s3.postimg.org/wz0eis1o3/single.png
I don't think you can reliably decide ahead of time which one should go for each operation. But, on the other hand, whichever one you > go for will probably be fast enough for anyone using Python. Most of the work here is adapting Numpy's machinery to dispatch a call to >
and even the size of the problem: the vector library, once that is ready, adding another one will hopefully be easier. At least, at the moment Numpy can use one of > several linear algebra packages (MKL, ATLAS, CBLAS...) and they are added, I think, without too much pain (but maybe I am just far > away from the screams of whoever did it).
So we are supposed to integrate just one of these libraries?(rest will use default if they didn't support) MKL seems to be good but as it have been discussed above that it's non-free and it have been integrated also, can you suggest any other library which at least approximate MKL in a better way? Though Eigen seems to be good, but it's seems to be worse in middle ranges. can you provide any link which provide comparative information about all available vector libraries(Free)? Thanks and regards, -- Durgesh Pandey, IIIT-Hyderabad,India.
On 11/03/15 23:20, Dp Docs wrote:
So we are supposed to integrate just one of these libraries?
As a Mac user I would be annoyed if we only supported MKL and not Accelerate Framework. AMD LibM should be supported too. MKL is non-free, but we use it for BLAS and LAPACK. AMD LibM is non-free in a similar manner. Acelerate Framework (vecLib) is a part of Apple's operating systems. You can abstract out the differences. Eigen is C++. We do not use C++ in NumPy, only C, Python and some Cython. C++ and Fortran can be used in SciPy, but not in NumPy. Apple's reference is here: https://developer.apple.com/library/mac/documentation/Performance/Conceptual... AMD's information is here: http://developer.amd.com/tools-and-sdks/cpu-development/libm/ (Vector math functions seem to be moved from ACML to LibM) Intel's reference is here: https://software.intel.com/sites/products/documentation/doclib/iss/2013/mkl/... Sturla
On Wed, Mar 11, 2015 at 11:20 PM, Dp Docs <sdpan21@gmail.com> wrote:
On Thu, Mar 12, 2015 at 2:01 AM, Daπid <davidmenhur@gmail.com> wrote:
On 11 March 2015 at 16:51, Dp Docs <sdpan21@gmail.com> wrote:
On Wed, Mar 11, 2015 at 7:52 PM, Sturla Molden <sturla.molden@gmail.com>
There are at least two ways to proceed here. One is to only use vector math when strides and alignment allow it.
I didn't got it. can you explain in detail?
One example, you can create a numpy 2D array using only the odd columns of a matrix.
odd_matrix = full_matrix[::2, ::2]
This is just a view of the original data, so you save the time and the memory of making a copy. The drawback is that you trash > memory locality, as the elements are not contiguous in memory. If the memory is guaranteed to be contiguous, a compiler can apply > extra optimisations, and this is what vector libraries usually assume. What I think Sturla is suggesting with "when strides and aligment > allow it" is to use the fast version if the array is contiguous, and fall back to the present implementation otherwise. Another would be to > make an optimally aligned copy, but that could eat up whatever time we save from using the faster library, and other problems.
The difficulty with Numpy's strides is that they allow so many ways of manipulating the data... (alternating elements, transpositions, different
wrote: precisions...).
I think the actual problem is not "to choose which library to
integrate", it is how to integrate these libraries? as I have seen the code >> base and been told the current implementation uses the c math library, Can we just use the current implementation and whenever it >> is calling C Maths functions, we will replace by these above fast library functions?Then we have to modify the Numpy library (which >> usually get imported for maths operation) by using some if else conditions like first work with the faster one and if it is not available >> the look for the Default one.
At the moment, we are linking to whichever LAPACK is avaliable at
compile time, so no need for a runtime check. I guess it could > (should?) be the same. I didn't understand this. I was asking about let say I have chosen one faster library, now I need to integrate this in *some way *without changing the default functionality so that when Numpy will import "from numpy import *",it should be able to access the integrated libraries functions as well as default libraries functions, What should we be that* some way*? Even at the Compile, it need to decide that which Function it is going to use, right?
Indeed, it should probably work similar to how BLAS/LAPACK functions are treated now. So you can support multiple libraries in numpy (pick only one to start with of course), but at compile time you'd pick the one to use. Then that library gets always called under the hood, i.e. no new public functions/objects in numpy but only improved performance of existing ones. It have been discussed above about integration of MKL libraries but when
MKL is not available on the hardware Architecture, will the above library support as default library? if yes, then the Above discussed integration method may be the required one for integration in this project, right? Can you please tell me a bit more or provide some link related to that? Availability of these faster Libraries depends on the Hardware Architectures etc. or availability of hardware Resources in a System? because if it is later one, this newly integrated library will support operations some time while sometimes not?
Not HW resources I'd think. Looking at http://www.yeppp.info, it supports all commonly used cpus/instruction sets. As long as the accuracy of the library is OK this should not be noticeable to users except for the difference in performance.
I believe it's the first one but it is better to clear any type of confusion. For example, assuming availability of Hardware means later one, let say if library A needed the A1 for it's support and A1 is busy then it will not be able to support the operation. Meanwhile, library B, needs Support of hardware type B1 , and it's not Busy then it will support these operations. What I want to say is Assuming the Availability of faster lib. means availability of hardware Resources in a System at a particular time when we want to do operation, it's totally unpredictable and Availability of these resources will be Random and even worse, if it take a bit extra time between compile and running, and that h/d resource have been allocated to other process in the meantime then it would be very problematic to use these operations. So this leads to think that Availability of lib. means type of h/d architecture whether it supports or not that lib. Since there are many kind of h/d architecture and it is not the case that one library support all these architectures (though it may be), So we need to integrate more than one lib. for providing support to all kind of architecture (in ideal case which will make it to be a very big project).
Moreover, I have Another Doubt also. are we suppose to integrate just
Are we suppose to think like this: Let say "exp" is faster in sleef
one fast library or more than one so that if one is not available, look for the second one and if second is not available then either go to default are look for the third one if available? library so integrate sleef library for this operation and let say "sin" is faster in any other library, so integrate that library for sin operation? I mean, it may be possible that different operations are faster in different libraries So the implementation should be operation oriented or just integrate one complete library?Thanks
Which one is faster depends on the hardware, the version of the library,
http://s3.postimg.org/wz0eis1o3/single.png
I don't think you can reliably decide ahead of time which one should go for each operation. But, on the other hand, whichever one you > go for will probably be fast enough for anyone using Python. Most of the work here is adapting Numpy's machinery to dispatch a call to >
and even the size of the problem: the vector library, once that is ready, adding another one will hopefully be easier. At least, at the moment Numpy can use one of > several linear algebra packages (MKL, ATLAS, CBLAS...) and they are added, I think, without too much pain (but maybe I am just far > away from the screams of whoever did it).
So we are supposed to integrate just one of these libraries?(rest will use default if they didn't support) MKL seems to be good but as it have been discussed above that it's non-free and it have been integrated also, can you suggest any other library which at least approximate MKL in a better way? Though Eigen seems to be good, but it's seems to be worse in middle ranges. can you provide any link which provide comparative information about all available vector libraries(Free)?
The idea on the GSoC page suggests http://www.yeppp.info/ or SLEEF ( http://shibatch.sourceforge.net/). Based on those websites I'm 99.9% sure that yeppp is a better bet. At least its benchmarks say that it's faster than MKL. As for the project, Julian (who'd likely be the main mentor) has already indicated when suggesting the idea that he has no interest in a non-free library: http://comments.gmane.org/gmane.comp.python.numeric.general/56933. So Yeppp + the build architecture to support multiple libraries later on would probably be a good target. Cheers, Ralf
Thanks to all of you for such a nice Discussion and Suggestion. I think most of my doubts have been resolved. If there will be something more i will let you people Know. Thanks again. -- Durgesh pandey IIIT Hyderabad,India
Ralf Gommers wrote:
On Wed, Mar 11, 2015 at 11:20 PM, Dp Docs <sdpan21@gmail.com> wrote:
On Thu, Mar 12, 2015 at 2:01 AM, Daπid <davidmenhur@gmail.com> wrote:
On 11 March 2015 at 16:51, Dp Docs <sdpan21@gmail.com> wrote:
On Wed, Mar 11, 2015 at 7:52 PM, Sturla Molden <sturla.molden@gmail.com>
There are at least two ways to proceed here. One is to only use vector math when strides and alignment allow it.
I didn't got it. can you explain in detail?
One example, you can create a numpy 2D array using only the odd columns of a matrix.
odd_matrix = full_matrix[::2, ::2]
This is just a view of the original data, so you save the time and the memory of making a copy. The drawback is that you trash > memory locality, as the elements are not contiguous in memory. If the memory is guaranteed to be contiguous, a compiler can apply > extra optimisations, and this is what vector libraries usually assume. What I think Sturla is suggesting with "when strides and aligment > allow it" is to use the fast version if the array is contiguous, and fall back to the present implementation otherwise. Another would be to > make an optimally aligned copy, but that could eat up whatever time we save from using the faster library, and other problems.
The difficulty with Numpy's strides is that they allow so many ways of manipulating the data... (alternating elements, transpositions, different
wrote: precisions...).
I think the actual problem is not "to choose which library to
integrate", it is how to integrate these libraries? as I have seen the code >> base and been told the current implementation uses the c math library, Can we just use the current implementation and whenever it >> is calling C Maths functions, we will replace by these above fast library functions?Then we have to modify the Numpy library (which >> usually get imported for maths operation) by using some if else conditions like first work with the faster one and if it is not available >> the look for the Default one.
At the moment, we are linking to whichever LAPACK is avaliable at
compile time, so no need for a runtime check. I guess it could > (should?) be the same. I didn't understand this. I was asking about let say I have chosen one faster library, now I need to integrate this in *some way *without changing the default functionality so that when Numpy will import "from numpy import *",it should be able to access the integrated libraries functions as well as default libraries functions, What should we be that* some way*? Even at the Compile, it need to decide that which Function it is going to use, right?
Indeed, it should probably work similar to how BLAS/LAPACK functions are treated now. So you can support multiple libraries in numpy (pick only one to start with of course), but at compile time you'd pick the one to use. Then that library gets always called under the hood, i.e. no new public functions/objects in numpy but only improved performance of existing ones.
It have been discussed above about integration of MKL libraries but when
MKL is not available on the hardware Architecture, will the above library support as default library? if yes, then the Above discussed integration method may be the required one for integration in this project, right? Can you please tell me a bit more or provide some link related to that? Availability of these faster Libraries depends on the Hardware Architectures etc. or availability of hardware Resources in a System? because if it is later one, this newly integrated library will support operations some time while sometimes not?
Not HW resources I'd think. Looking at http://www.yeppp.info, it supports all commonly used cpus/instruction sets. As long as the accuracy of the library is OK this should not be noticeable to users except for the difference in performance.
I believe it's the first one but it is better to clear any type of confusion. For example, assuming availability of Hardware means later one, let say if library A needed the A1 for it's support and A1 is busy then it will not be able to support the operation. Meanwhile, library B, needs Support of hardware type B1 , and it's not Busy then it will support these operations. What I want to say is Assuming the Availability of faster lib. means availability of hardware Resources in a System at a particular time when we want to do operation, it's totally unpredictable and Availability of these resources will be Random and even worse, if it take a bit extra time between compile and running, and that h/d resource have been allocated to other process in the meantime then it would be very problematic to use these operations. So this leads to think that Availability of lib. means type of h/d architecture whether it supports or not that lib. Since there are many kind of h/d architecture and it is not the case that one library support all these architectures (though it may be), So we need to integrate more than one lib. for providing support to all kind of architecture (in ideal case which will make it to be a very big project).
Moreover, I have Another Doubt also. are we suppose to integrate just
Are we suppose to think like this: Let say "exp" is faster in sleef
one fast library or more than one so that if one is not available, look for the second one and if second is not available then either go to default are look for the third one if available? library so integrate sleef library for this operation and let say "sin" is faster in any other library, so integrate that library for sin operation? I mean, it may be possible that different operations are faster in different libraries So the implementation should be operation oriented or just integrate one complete library?Thanks
Which one is faster depends on the hardware, the version of the library,
http://s3.postimg.org/wz0eis1o3/single.png
I don't think you can reliably decide ahead of time which one should go for each operation. But, on the other hand, whichever one you > go for will probably be fast enough for anyone using Python. Most of the work here is adapting Numpy's machinery to dispatch a call to >
and even the size of the problem: the vector library, once that is ready, adding another one will hopefully be easier. At least, at the moment Numpy can use one of > several linear algebra packages (MKL, ATLAS, CBLAS...) and they are added, I think, without too much pain (but maybe I am just far > away from the screams of whoever did it).
So we are supposed to integrate just one of these libraries?(rest will use default if they didn't support) MKL seems to be good but as it have been discussed above that it's non-free and it have been integrated also, can you suggest any other library which at least approximate MKL in a better way? Though Eigen seems to be good, but it's seems to be worse in middle ranges. can you provide any link which provide comparative information about all available vector libraries(Free)?
The idea on the GSoC page suggests http://www.yeppp.info/ or SLEEF ( http://shibatch.sourceforge.net/). Based on those websites I'm 99.9% sure that yeppp is a better bet. At least its benchmarks say that it's faster than MKL. As for the project, Julian (who'd likely be the main mentor) has already indicated when suggesting the idea that he has no interest in a non-free library: http://comments.gmane.org/gmane.comp.python.numeric.general/56933. So Yeppp + the build architecture to support multiple libraries later on would probably be a good target.
Cheers, Ralf
Thanks for the tip about yeppp. While it looks interesting, it seems to be pretty limited. Just a few transcendental functions. I didn't notice complex either (e.g., dot product). -- Those who fail to understand recursion are doomed to repeat it
Am 08.03.2015 um 21:47 schrieb Dp Docs <sdpan21@gmail.com>:
Hi all, I am a CS 3rd Undergrad. Student from an Indian Institute (III T). I believe I am good in Programming languages like C/C++, Python as I have already done Some Projects using these language as a part of my academics. I really like Coding (Competitive as well as development). I really want to get involved in Numpy Development Project and want to take "Vector math library integration" as a part of my project. I want to here any idea from your side for this project. Thanks For your time for reading this email and responding back.
On the scipy mailing list I also answered to Amine, who is also interested in this proposal. Long time ago I wrote a package that provides fast math functions (ufuncs) for numpy, using Intel’s MKL/VML library, see https://github.com/geggo/uvml <https://github.com/geggo/uvml> and my comments there. This code could be easily ported to use other vector math libraries. Would be interesting to evaluate other possibilities. Due to the fact that MKL is non-free, there are concerns to use it with numpy, although e.g. numpy and scipy using the MKL LAPACK routines are used frequently (Anaconda or Christoph Gohlkes binaries). You can easily inject the fast math ufuncs into numpy, e.g. with set_numeric_ops() or np.sin = vml.sin. Gregor
On Wed, Mar 11, 2015 at 10:34 PM, Gregor Thalhammer < gregor.thalhammer@gmail.com> wrote:
On the scipy mailing list I also answered to Amine, who is also
interested in this proposal. Can you provide the link of that discussion? I am getting trouble in searching that. > Long time ago I wrote a package that provides fast math functions (ufuncs) for numpy, using Intel’s MKL/VML library, see https://github.com/geggo/uvml and my comments > there. This code could be easily ported to use other vector math libraries. When MKL is not available for a System, will this integration work with default numpy maths functions? > Would be interesting to evaluate other possibilities. Due to > the fact that MKL is non-free, there are concerns to use it with numpy, > although e.g. numpy and scipy using the MKL LAPACK > routines are used frequently (Anaconda or Christoph Gohlkes binaries).
You can easily inject the fast math ufuncs into numpy, e.g. with
set_numeric_ops() or np.sin = vml.sin. Can you explain in a bit detail or provide a link where i can see it? Thanks for your valuable suggestion. -- Durgesh Pandey, IIIT-Hyderabad,India.
Am 11.03.2015 um 23:18 schrieb Dp Docs <sdpan21@gmail.com>:
On Wed, Mar 11, 2015 at 10:34 PM, Gregor Thalhammer <gregor.thalhammer@gmail.com <mailto:gregor.thalhammer@gmail.com>> wrote:
On the scipy mailing list I also answered to Amine, who is also interested in this proposal.
Can you provide the link of that discussion? I am getting trouble in searching that.
>Long time ago I wrote a package that
provides fast math functions (ufuncs) for numpy, using Intel’s MKL/VML library, see https://github.com/geggo/uvml <https://github.com/geggo/uvml> and my comments >there. This code could be easily ported to use other vector math libraries.
When MKL is not available for a System, will this integration work with default numpy maths functions? > Would be interesting to evaluate other possibilities. Due to >the fact that MKL is non-free, there are concerns to use it with numpy, >although e.g. numpy and scipy using the MKL LAPACK >routines are used frequently (Anaconda or Christoph Gohlkes binaries).
You can easily inject the fast math ufuncs into numpy, e.g. with set_numeric_ops() or np.sin = vml.sin.
Can you explain in a bit detail or provide a link where i can see it?
My approach for https://github.com/geggo/uvml <https://github.com/geggo/uvml> was to provide a separate python extension that provides faster numpy ufuncs for math operations like exp, sin, cos, … To replace the standard numpy ufuncs by the optimized ones you don’t need to apply changes to the source code of numpy, instead at runtime you monkey patch it and get faster math everywhere. Numpy even offers an interface (set_numeric_ops) to modify it at runtime. Another note, numpy makes it easy to provide new ufuncs, see http://docs.scipy.org/doc/numpy-dev/user/c-info.ufunc-tutorial.html <http://docs.scipy.org/doc/numpy-dev/user/c-info.ufunc-tutorial.html> from a C function that operates on 1D arrays, but this function needs to support arbitrary spacing (stride) between the items. Unfortunately, to achieve good performance, vector math libraries often expect that the items are laid out contiguously in memory. MKL/VML is a notable exception. So for non contiguous in- or output arrays you might need to copy the data to a buffer, which likely kills large amounts of the performance gain. This does not completely rule out some of the libraries, since performance critical data is likely to be stored in contiguous arrays. Using a library that supports only vector math for contiguous arrays is more difficult, but perhaps the numpy nditer provides everything needed. Gregor
On 03/12/2015 10:15 AM, Gregor Thalhammer wrote:
Another note, numpy makes it easy to provide new ufuncs, see http://docs.scipy.org/doc/numpy-dev/user/c-info.ufunc-tutorial.html from a C function that operates on 1D arrays, but this function needs to support arbitrary spacing (stride) between the items. Unfortunately, to achieve good performance, vector math libraries often expect that the items are laid out contiguously in memory. MKL/VML is a notable exception. So for non contiguous in- or output arrays you might need to copy the data to a buffer, which likely kills large amounts of the performance gain.
The elementary functions are very slow even compared to memory access, they take in the orders of hundreds to tens of thousand cycles to complete (depending on range and required accuracy). Even in the case of strided access that gives the hardware prefetchers plenty of time to load the data before the previous computation is done. This also removes the requirement from the library to provide a strided api, we can copy the strided data into a contiguous buffer and pass it to the library without losing much performance. It may not be optimal (e.g. a library can fine tune the prefetching better for the case where the hardware is not ideal) but most likely sufficient. Figuring out how to best do it to get the best performance and still being flexible in what implementation is used is part of the challenge the student will face for this project.
Am 12.03.2015 um 13:48 schrieb Julian Taylor <jtaylor.debian@googlemail.com>:
On 03/12/2015 10:15 AM, Gregor Thalhammer wrote:
Another note, numpy makes it easy to provide new ufuncs, see http://docs.scipy.org/doc/numpy-dev/user/c-info.ufunc-tutorial.html from a C function that operates on 1D arrays, but this function needs to support arbitrary spacing (stride) between the items. Unfortunately, to achieve good performance, vector math libraries often expect that the items are laid out contiguously in memory. MKL/VML is a notable exception. So for non contiguous in- or output arrays you might need to copy the data to a buffer, which likely kills large amounts of the performance gain.
The elementary functions are very slow even compared to memory access, they take in the orders of hundreds to tens of thousand cycles to complete (depending on range and required accuracy). Even in the case of strided access that gives the hardware prefetchers plenty of time to load the data before the previous computation is done.
That might apply to the mathematical functions from the standard libraries, but that is not true for the optimized libraries. Typical numbers are 4-10 CPU cycles per operation, see e.g. https://software.intel.com/sites/products/documentation/doclib/mkl_sa/112/vm... The benchmarks at https://github.com/geggo/uvml <https://github.com/geggo/uvml> show that memory access to main memory limits the performance for the calculation of exp for large array sizes . This test was done quite some time ago, memory bandwidth now typically is higher, but also computational power.
This also removes the requirement from the library to provide a strided api, we can copy the strided data into a contiguous buffer and pass it to the library without losing much performance. It may not be optimal (e.g. a library can fine tune the prefetching better for the case where the hardware is not ideal) but most likely sufficient.
Copying the data to a small enough buffer so it fits into cache might add a few cycles, this already impacts performance significantly. Curious to see how much. Gregor
Figuring out how to best do it to get the best performance and still being flexible in what implementation is used is part of the challenge the student will face for this project. _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
2015-03-08 21:47 GMT+01:00 Dp Docs <sdpan21@gmail.com>:
Hi all, I am a CS 3rd Undergrad. Student from an Indian Institute (III T). I believe I am good in Programming languages like C/C++, Python as I have already done Some Projects using these language as a part of my academics. I really like Coding (Competitive as well as development). I really want to get involved in Numpy Development Project and want to take "Vector math library integration" as a part of my project. I want to here any idea from your side for this project. Thanks For your time for reading this email and responding back.
As Sturla and Gregor suggested, there are quite a few attempts to solve this shortcoming in NumPy. In particular Gregor integrated MKL/VML support in numexpr quite a long time ago, and when combined with my own implementation of pooled threads (behaving better than Intel's implementation in VML), then the thing literally flies: https://github.com/pydata/numexpr/wiki/NumexprMKL numba is also another interesting option and it shows much better compiling times than the integrated compiler in numexpr. You can see a quick comparison about expected performances between numexpr and numba: http://nbviewer.ipython.org/gist/anonymous/4117896 In general, numba wins for small arrays, but numexpr can achieve very good performance for larger ones. I think there are interesting things to discover in both projects, as for example, how they manage memory in order to avoid temporaries or how they deal with unaligned data efficiently. I would advise to look at existing docs and presentations explaining things in more detail too. All in all, I would really love to see such a vector math library support integrated in NumPy because frankly, I don't have bandwidth for maintaining numexpr anymore (and I am afraid that nobody else would jump in this ship ;). Good luck! Francesc
My IRCnickname: dp
Real Name: Durgesh Pandey. _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
-- Francesc Alted
participants (8)
-
Daπid
-
Dp Docs
-
Francesc Alted
-
Gregor Thalhammer
-
Julian Taylor
-
Neal Becker
-
Ralf Gommers
-
Sturla Molden