I'm trying to do a bit of benchmarking to see if amd libm/acml will help me. I got an idea that instead of building all of numpy/scipy and all of my custom modules against these libraries, I could simply use: LD_PRELOAD=/opt/amdlibm-3.0.2/lib/dynamic/libamdlibm.so:/opt/acml5.2.0/gfortran64/lib/libacml.so <my program here> I'm hoping that both numpy and my own dll's then will take advantage of these libraries. Do you think this will work?
On Wed, Nov 7, 2012 at 12:35 PM, Neal Becker
I'm trying to do a bit of benchmarking to see if amd libm/acml will help me.
I got an idea that instead of building all of numpy/scipy and all of my custom modules against these libraries, I could simply use:
LD_PRELOAD=/opt/amdlibm-3.0.2/lib/dynamic/libamdlibm.so:/opt/acml5.2.0/gfortran64/lib/libacml.so <my program here>
I'm hoping that both numpy and my own dll's then will take advantage of these libraries.
Do you think this will work?
Quite unlikely depending on your configuration, because those libraries are rarely if ever ABI compatible (that's why it is such a pain to support). David
David Cournapeau wrote:
On Wed, Nov 7, 2012 at 12:35 PM, Neal Becker
wrote: I'm trying to do a bit of benchmarking to see if amd libm/acml will help me.
I got an idea that instead of building all of numpy/scipy and all of my custom modules against these libraries, I could simply use:
LD_PRELOAD=/opt/amdlibm-3.0.2/lib/dynamic/libamdlibm.so:/opt/acml5.2.0/gfortran64/lib/libacml.so
<my program here>
I'm hoping that both numpy and my own dll's then will take advantage of these libraries.
Do you think this will work?
Quite unlikely depending on your configuration, because those libraries are rarely if ever ABI compatible (that's why it is such a pain to support).
David
When you say quite unlikely (to work), you mean a) unlikely that libm/acml will be used to resolve symbols in numpy/dlls at runtime (e.g., exp)? or b) program may produce wrong results and/or crash ?
On Wed, Nov 7, 2012 at 1:56 PM, Neal Becker
David Cournapeau wrote:
On Wed, Nov 7, 2012 at 12:35 PM, Neal Becker
wrote: I'm trying to do a bit of benchmarking to see if amd libm/acml will help me.
I got an idea that instead of building all of numpy/scipy and all of my custom modules against these libraries, I could simply use:
LD_PRELOAD=/opt/amdlibm-3.0.2/lib/dynamic/libamdlibm.so:/opt/acml5.2.0/gfortran64/lib/libacml.so
<my program here>
I'm hoping that both numpy and my own dll's then will take advantage of these libraries.
Do you think this will work?
Quite unlikely depending on your configuration, because those libraries are rarely if ever ABI compatible (that's why it is such a pain to support).
David
When you say quite unlikely (to work), you mean
a) unlikely that libm/acml will be used to resolve symbols in numpy/dlls at runtime (e.g., exp)?
or
b) program may produce wrong results and/or crash ?
Both, actually. That's not something I would use myself. Did you try openblas ? It is open source, simple to build, and is pretty fast, David
David Cournapeau wrote:
On Wed, Nov 7, 2012 at 1:56 PM, Neal Becker
wrote: David Cournapeau wrote:
On Wed, Nov 7, 2012 at 12:35 PM, Neal Becker
wrote: I'm trying to do a bit of benchmarking to see if amd libm/acml will help me.
I got an idea that instead of building all of numpy/scipy and all of my custom modules against these libraries, I could simply use:
LD_PRELOAD=/opt/amdlibm-3.0.2/lib/dynamic/libamdlibm.so:/opt/acml5.2.0/gfortran64/lib/libacml.so
<my program here>
I'm hoping that both numpy and my own dll's then will take advantage of these libraries.
Do you think this will work?
Quite unlikely depending on your configuration, because those libraries are rarely if ever ABI compatible (that's why it is such a pain to support).
David
When you say quite unlikely (to work), you mean
a) unlikely that libm/acml will be used to resolve symbols in numpy/dlls at runtime (e.g., exp)?
or
b) program may produce wrong results and/or crash ?
Both, actually. That's not something I would use myself. Did you try openblas ? It is open source, simple to build, and is pretty fast,
David
Actually, for my current work, I'm more concerned with speeding up operations such as exp, log and basic vector arithmetic. Any thoughts on that?
David Cournapeau wrote:
On Wed, Nov 7, 2012 at 1:56 PM, Neal Becker
wrote: David Cournapeau wrote:
On Wed, Nov 7, 2012 at 12:35 PM, Neal Becker
wrote: I'm trying to do a bit of benchmarking to see if amd libm/acml will help me.
I got an idea that instead of building all of numpy/scipy and all of my custom modules against these libraries, I could simply use:
LD_PRELOAD=/opt/amdlibm-3.0.2/lib/dynamic/libamdlibm.so:/opt/acml5.2.0/gfortran64/lib/libacml.so
<my program here>
I'm hoping that both numpy and my own dll's then will take advantage of these libraries.
Do you think this will work?
Quite unlikely depending on your configuration, because those libraries are rarely if ever ABI compatible (that's why it is such a pain to support).
David
When you say quite unlikely (to work), you mean
a) unlikely that libm/acml will be used to resolve symbols in numpy/dlls at runtime (e.g., exp)?
or
b) program may produce wrong results and/or crash ?
Both, actually. That's not something I would use myself. Did you try openblas ? It is open source, simple to build, and is pretty fast,
David
In my current work, probably the largest bottlenecks are 'max*', which are log (\sum e^(x_i))
On 11/07/2012 03:30 PM, Neal Becker wrote:
David Cournapeau wrote:
On Wed, Nov 7, 2012 at 1:56 PM, Neal Becker
wrote: David Cournapeau wrote:
On Wed, Nov 7, 2012 at 12:35 PM, Neal Becker
wrote: I'm trying to do a bit of benchmarking to see if amd libm/acml will help me.
I got an idea that instead of building all of numpy/scipy and all of my custom modules against these libraries, I could simply use:
LD_PRELOAD=/opt/amdlibm-3.0.2/lib/dynamic/libamdlibm.so:/opt/acml5.2.0/gfortran64/lib/libacml.so
<my program here>
I'm hoping that both numpy and my own dll's then will take advantage of these libraries.
Do you think this will work?
Quite unlikely depending on your configuration, because those libraries are rarely if ever ABI compatible (that's why it is such a pain to support).
David
When you say quite unlikely (to work), you mean
a) unlikely that libm/acml will be used to resolve symbols in numpy/dlls at runtime (e.g., exp)?
or
b) program may produce wrong results and/or crash ?
Both, actually. That's not something I would use myself. Did you try openblas ? It is open source, simple to build, and is pretty fast,
David
In my current work, probably the largest bottlenecks are 'max*', which are
log (\sum e^(x_i))
numexpr with Intel VML is the solution I know of that doesn't require you to dig into compiling C code yourself. Did you look into that or is using Intel VML/MKL not an option? Fast exps depend on the CPU evaluating many exp's at the same time (both explicit through vector registers, and implicit through pipelining); even if you get what you try to work (which is unlikely I think) the approach is inherently slow, since just passing a single number at the time through the "exp" function can't be efficient. Dag Sverre
On Wed, Nov 7, 2012 at 11:41 AM, Neal Becker
Would you expect numexpr without MKL to give a significant boost?
It can, depending on the use case: -- It can remove a lot of uneccessary temporary creation. -- IIUC, it works on blocks of data at a time, and thus can keep things in cache more when working with large data sets. -- It can (optionally) use multiple threads for easy parallelization. All you can do is try it on your use-case and see what you get. It's a pretty light lift to try. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
On 11/7/12 8:41 PM, Neal Becker wrote:
Would you expect numexpr without MKL to give a significant boost?
Yes. Have a look at how numexpr's own multi-threaded virtual machine compares with numexpr using VML: http://code.google.com/p/numexpr/wiki/NumexprVML As it can be seen, the best results are obtained by using the multi-threaded VM in numexpr in combination with a single-threaded VML engine. Caution: I did these benchmarks some time ago (couple of years?), so it might be that multi-threaded VML would have improved by now. If performance is critical, some experiments should be done first so as to find the optimal configuration. At any rate, VML will let you to optimally leverage the SIMD instructions in the cores, allowing to compute, for example, exp() in 1 or 2 clock cycles (depending on the vector length, the number of cores in your system and the data precision): http://software.intel.com/sites/products/documentation/hpc/mkl/vml/functions... Pretty amazing. -- Francesc Alted
On 11/8/12 12:35 AM, Chris Barker wrote:
Would you expect numexpr without MKL to give a significant boost? It can, depending on the use case: -- It can remove a lot of uneccessary temporary creation. -- IIUC, it works on blocks of data at a time, and thus can keep
On Wed, Nov 7, 2012 at 11:41 AM, Neal Becker
wrote: things in cache more when working with large data sets.
Well, the temporaries are still created, but the thing is that, by working with small blocks at a time, these temporaries fit in CPU cache, preventing copies into main memory. I like to name this the 'blocking technique', as explained in slide 26 (and following) in: https://python.g-node.org/wiki/_media/starving_cpu/starving-cpu.pdf A better technique is to reduce the block size to the minimal expression (1 element), so temporaries are stored in registers in CPU instead of small blocks in cache, hence preventing copies even in *cache*. Numba (https://github.com/numba/numba) follows this approach, which is pretty optimal as can be seen in slide 37 of the lecture above.
-- It can (optionally) use multiple threads for easy parallelization.
No, the *total* amount of cores detected in the system is the default in numexpr; if you want less, you will need to use set_num_threads(nthreads) function. But agreed, sometimes using too many threads could effectively be counter-producing. -- Francesc Alted
On 11/07/2012 08:41 PM, Neal Becker wrote:
Would you expect numexpr without MKL to give a significant boost?
If you need higher performance than what numexpr can give without using MKL, you could look at code such as this: https://github.com/herumi/fmath/blob/master/fmath.hpp#L480 But that means going to C (e.g., by wrapping that function in Cython). Pay attention to what range you evaluate the function in though (my eyes may deceive me but it seems that the test program only test for arguments drawn from the standard Gaussian which is a bit limited..) Dag Sverre
On Thu, Nov 8, 2012 at 2:22 AM, Francesc Alted
-- It can remove a lot of uneccessary temporary creation.
Well, the temporaries are still created, but the thing is that, by working with small blocks at a time, these temporaries fit in CPU cache, preventing copies into main memory.
hmm -- I thought it was "smart" enough to remove some unnecessary temporaries altogether. Shows what I know. But apparently it does, indeed, avoid creating the full-size temporary arrays. pretty cool stuff, in any case. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
On 11/8/12 1:41 PM, Dag Sverre Seljebotn wrote:
On 11/07/2012 08:41 PM, Neal Becker wrote:
Would you expect numexpr without MKL to give a significant boost? If you need higher performance than what numexpr can give without using MKL, you could look at code such as this:
Hey, that's cool. I was a bit disappointed not finding this sort of work in open space. It seems that this lacks threading support, but that should be easy to implement by using OpenMP directives. -- Francesc Alted
On 11/08/2012 06:06 PM, Francesc Alted wrote:
On 11/8/12 1:41 PM, Dag Sverre Seljebotn wrote:
On 11/07/2012 08:41 PM, Neal Becker wrote:
Would you expect numexpr without MKL to give a significant boost? If you need higher performance than what numexpr can give without using MKL, you could look at code such as this:
Hey, that's cool. I was a bit disappointed not finding this sort of work in open space. It seems that this lacks threading support, but that should be easy to implement by using OpenMP directives.
IMO this is the wrong place to introduce threading; each thread should call expd_v on its chunks. (Which I think is how you said numexpr currently uses VML anyway.) Dag Sverre
On 11/8/12 6:38 PM, Dag Sverre Seljebotn wrote:
On 11/08/2012 06:06 PM, Francesc Alted wrote:
On 11/07/2012 08:41 PM, Neal Becker wrote:
Would you expect numexpr without MKL to give a significant boost? If you need higher performance than what numexpr can give without using MKL, you could look at code such as this:
https://github.com/herumi/fmath/blob/master/fmath.hpp#L480 Hey, that's cool. I was a bit disappointed not finding this sort of work in open space. It seems that this lacks threading support, but
On 11/8/12 1:41 PM, Dag Sverre Seljebotn wrote: that should be easy to implement by using OpenMP directives. IMO this is the wrong place to introduce threading; each thread should call expd_v on its chunks. (Which I think is how you said numexpr currently uses VML anyway.)
Oh sure, but then you need a blocked engine for performing the computations too. And yes, by default numexpr uses its own threading code rather than the existing one in VML (but that can be changed by playing with set_num_threads/set_vml_num_threads). It always stroked to me as a little strange that the internal threading in numexpr was more efficient than VML one, but I suppose this is because the latter is more optimized to deal with large blocks instead of those of medium size (4K) in numexpr. -- Francesc Alted
On 11/08/2012 06:59 PM, Francesc Alted wrote:
On 11/8/12 6:38 PM, Dag Sverre Seljebotn wrote:
On 11/08/2012 06:06 PM, Francesc Alted wrote:
On 11/07/2012 08:41 PM, Neal Becker wrote:
Would you expect numexpr without MKL to give a significant boost? If you need higher performance than what numexpr can give without using MKL, you could look at code such as this:
https://github.com/herumi/fmath/blob/master/fmath.hpp#L480 Hey, that's cool. I was a bit disappointed not finding this sort of work in open space. It seems that this lacks threading support, but
On 11/8/12 1:41 PM, Dag Sverre Seljebotn wrote: that should be easy to implement by using OpenMP directives. IMO this is the wrong place to introduce threading; each thread should call expd_v on its chunks. (Which I think is how you said numexpr currently uses VML anyway.)
Oh sure, but then you need a blocked engine for performing the computations too. And yes, by default numexpr uses its own threading
I just meant that you can use a chunked OpenMP for-loop wherever in your code that you call expd_v. A "five-line blocked engine", if you like :-) IMO that's the right location since entering/exiting OpenMP blocks takes some time.
code rather than the existing one in VML (but that can be changed by playing with set_num_threads/set_vml_num_threads). It always stroked to me as a little strange that the internal threading in numexpr was more efficient than VML one, but I suppose this is because the latter is more optimized to deal with large blocks instead of those of medium size (4K) in numexpr.
I don't know enough about numexpr to understand this :-) I guess I just don't see the motivation to use VML threading or why it should be faster? If you pass a single 4K block to a threaded VML call then I could easily see lots of performance problems: a) starting/stopping threads or signalling the threads of a pool is a constant overhead per "parallel section", b) unless you're very careful to only have VML touch the data, and VML always schedules elements in the exact same way, you're going to have the cache lines of that 4K block shuffled between L1 caches of different cores for different operations... As I said, I'm mostly ignorant about how numexpr works, that's probably showing :-) Dag Sverre
On 11/08/2012 07:55 PM, Dag Sverre Seljebotn wrote:
On 11/08/2012 06:59 PM, Francesc Alted wrote:
On 11/8/12 6:38 PM, Dag Sverre Seljebotn wrote:
On 11/08/2012 06:06 PM, Francesc Alted wrote:
On 11/07/2012 08:41 PM, Neal Becker wrote:
Would you expect numexpr without MKL to give a significant boost? If you need higher performance than what numexpr can give without using MKL, you could look at code such as this:
https://github.com/herumi/fmath/blob/master/fmath.hpp#L480 Hey, that's cool. I was a bit disappointed not finding this sort of work in open space. It seems that this lacks threading support, but
On 11/8/12 1:41 PM, Dag Sverre Seljebotn wrote: that should be easy to implement by using OpenMP directives. IMO this is the wrong place to introduce threading; each thread should call expd_v on its chunks. (Which I think is how you said numexpr currently uses VML anyway.)
Oh sure, but then you need a blocked engine for performing the computations too. And yes, by default numexpr uses its own threading
I just meant that you can use a chunked OpenMP for-loop wherever in your code that you call expd_v. A "five-line blocked engine", if you like :-)
IMO that's the right location since entering/exiting OpenMP blocks takes some time.
code rather than the existing one in VML (but that can be changed by playing with set_num_threads/set_vml_num_threads). It always stroked to me as a little strange that the internal threading in numexpr was more efficient than VML one, but I suppose this is because the latter is more optimized to deal with large blocks instead of those of medium size (4K) in numexpr.
I don't know enough about numexpr to understand this :-)
I guess I just don't see the motivation to use VML threading or why it should be faster? If you pass a single 4K block to a threaded VML call then I could easily see lots of performance problems: a) starting/stopping threads or signalling the threads of a pool is a constant overhead per "parallel section", b) unless you're very careful to only have VML touch the data, and VML always schedules elements in the exact same way, you're going to have the cache lines of that 4K block shuffled between L1 caches of different cores for different operations...
c) Your "effective block size" is then 4KB/ncores. (Unless you scale the block size by ncores). DS
On 11/8/12 7:55 PM, Dag Sverre Seljebotn wrote:
On 11/08/2012 06:59 PM, Francesc Alted wrote:
On 11/8/12 6:38 PM, Dag Sverre Seljebotn wrote:
On 11/08/2012 06:06 PM, Francesc Alted wrote:
On 11/07/2012 08:41 PM, Neal Becker wrote:
Would you expect numexpr without MKL to give a significant boost? If you need higher performance than what numexpr can give without using MKL, you could look at code such as this:
https://github.com/herumi/fmath/blob/master/fmath.hpp#L480 Hey, that's cool. I was a bit disappointed not finding this sort of work in open space. It seems that this lacks threading support, but
On 11/8/12 1:41 PM, Dag Sverre Seljebotn wrote: that should be easy to implement by using OpenMP directives. IMO this is the wrong place to introduce threading; each thread should call expd_v on its chunks. (Which I think is how you said numexpr currently uses VML anyway.) Oh sure, but then you need a blocked engine for performing the computations too. And yes, by default numexpr uses its own threading I just meant that you can use a chunked OpenMP for-loop wherever in your code that you call expd_v. A "five-line blocked engine", if you like :-)
IMO that's the right location since entering/exiting OpenMP blocks takes some time.
Yes, I meant precisely this first hand.
code rather than the existing one in VML (but that can be changed by playing with set_num_threads/set_vml_num_threads). It always stroked to me as a little strange that the internal threading in numexpr was more efficient than VML one, but I suppose this is because the latter is more optimized to deal with large blocks instead of those of medium size (4K) in numexpr. I don't know enough about numexpr to understand this :-)
I guess I just don't see the motivation to use VML threading or why it should be faster? If you pass a single 4K block to a threaded VML call then I could easily see lots of performance problems: a) starting/stopping threads or signalling the threads of a pool is a constant overhead per "parallel section", b) unless you're very careful to only have VML touch the data, and VML always schedules elements in the exact same way, you're going to have the cache lines of that 4K block shuffled between L1 caches of different cores for different operations...
As I said, I'm mostly ignorant about how numexpr works, that's probably showing :-)
No, on the contrary, you rather hit the core of the issue (or part of it). On one hand, VML needs large blocks in order to maximize the performance of the pipeline and in the other hand numexpr tries to minimize block size in order to make temporaries as small as possible (so avoiding the use of the higher level caches). From this tension (and some benchmarking work) the size of 4K (btw, this is the number of *elements*, so the size is actually either 16 KB and 32 KB for single and double precision respectively) was derived. Incidentally, for numexpr with no VML support, the size is reduced to 1K elements (and perhaps it could be reduced a bit more, but anyways). Anyway, this is way too low level to be discussed here, although we can continue on the numexpr list if you are interested in more details. -- Francesc Alted
participants (5)
-
Chris Barker
-
Dag Sverre Seljebotn
-
David Cournapeau
-
Francesc Alted
-
Neal Becker