[Numpy-discussion] testing with amd libm/acml

Thu Nov 8 13:56:27 EST 2012

On 11/08/2012 07:55 PM, Dag Sverre Seljebotn wrote:
> On 11/08/2012 06:59 PM, Francesc Alted wrote:
>> On 11/8/12 6:38 PM, Dag Sverre Seljebotn wrote:
>>> On 11/08/2012 06:06 PM, Francesc Alted wrote:
>>>> On 11/8/12 1:41 PM, Dag Sverre Seljebotn wrote:
>>>>> On 11/07/2012 08:41 PM, Neal Becker wrote:
>>>>>> Would you expect numexpr without MKL to give a significant boost?
>>>>> If you need higher performance than what numexpr can give without
>>>>> using
>>>>> MKL, you could look at code such as this:
>>>>>
>>>>> https://github.com/herumi/fmath/blob/master/fmath.hpp#L480
>>>> Hey, that's cool.  I was a bit disappointed not finding this sort of
>>>> work in open space.  It seems that this lacks threading support, but
>>>> that should be easy to implement by using OpenMP directives.
>>> IMO this is the wrong place to introduce threading; each thread should
>>> call expd_v on its chunks. (Which I think is how you said numexpr
>>> currently uses VML anyway.)
>>
>> Oh sure, but then you need a blocked engine for performing the
>> computations too.  And yes, by default numexpr uses its own threading
>
> I just meant that you can use a chunked OpenMP for-loop wherever in your
> code that you call expd_v. A "five-line blocked engine", if you like :-)
>
> IMO that's the right location since entering/exiting OpenMP blocks takes
> some time.
>
>> code rather than the existing one in VML (but that can be changed by
>> playing with set_num_threads/set_vml_num_threads).  It always stroked to
>> me as a little strange that the internal threading in numexpr was more
>> efficient than VML one, but I suppose this is because the latter is more
>> optimized to deal with large blocks instead of those of medium size (4K)
>> in numexpr.
>
> I don't know enough about numexpr to understand this :-)
>
> I guess I just don't see the motivation to use VML threading or why it
> should be faster? If you pass a single 4K block to a threaded VML call
> then I could easily see lots of performance problems: a)
> starting/stopping threads or signalling the threads of a pool is a
> constant overhead per "parallel section", b) unless you're very careful
> to only have VML touch the data, and VML always schedules elements in
> the exact same way, you're going to have the cache lines of that 4K
> block shuffled between L1 caches of different cores for different
> operations...

c) Your "effective block size" is then 4KB/ncores.

(Unless you scale the block size by ncores).

DS