[Numpy-discussion] testing with amd libm/acml

Thu Nov 8 14:37:30 EST 2012

On 11/8/12 7:55 PM, Dag Sverre Seljebotn wrote:
> On 11/08/2012 06:59 PM, Francesc Alted wrote:
>> On 11/8/12 6:38 PM, Dag Sverre Seljebotn wrote:
>>> On 11/08/2012 06:06 PM, Francesc Alted wrote:
>>>> On 11/8/12 1:41 PM, Dag Sverre Seljebotn wrote:
>>>>> On 11/07/2012 08:41 PM, Neal Becker wrote:
>>>>>> Would you expect numexpr without MKL to give a significant boost?
>>>>> If you need higher performance than what numexpr can give without using
>>>>> MKL, you could look at code such as this:
>>>>>
>>>>> https://github.com/herumi/fmath/blob/master/fmath.hpp#L480
>>>> Hey, that's cool.  I was a bit disappointed not finding this sort of
>>>> work in open space.  It seems that this lacks threading support, but
>>>> that should be easy to implement by using OpenMP directives.
>>> IMO this is the wrong place to introduce threading; each thread should
>>> call expd_v on its chunks. (Which I think is how you said numexpr
>>> currently uses VML anyway.)
>> Oh sure, but then you need a blocked engine for performing the
>> computations too.  And yes, by default numexpr uses its own threading
> I just meant that you can use a chunked OpenMP for-loop wherever in your
> code that you call expd_v. A "five-line blocked engine", if you like :-)
>
> IMO that's the right location since entering/exiting OpenMP blocks takes
> some time.

Yes, I meant precisely this first hand.

>> code rather than the existing one in VML (but that can be changed by
>> playing with set_num_threads/set_vml_num_threads).  It always stroked to
>> me as a little strange that the internal threading in numexpr was more
>> efficient than VML one, but I suppose this is because the latter is more
>> optimized to deal with large blocks instead of those of medium size (4K)
>> in numexpr.
> I don't know enough about numexpr to understand this :-)
>
> I guess I just don't see the motivation to use VML threading or why it
> should be faster? If you pass a single 4K block to a threaded VML call
> then I could easily see lots of performance problems: a)
> starting/stopping threads or signalling the threads of a pool is a
> constant overhead per "parallel section", b) unless you're very careful
> to only have VML touch the data, and VML always schedules elements in
> the exact same way, you're going to have the cache lines of that 4K
> block shuffled between L1 caches of different cores for different
> operations...
>
> As I said, I'm mostly ignorant about how numexpr works, that's probably
> showing :-)

No, on the contrary, you rather hit the core of the issue (or part of 
it).  On one hand, VML needs large blocks in order to maximize the 
performance of the pipeline and in the other hand numexpr tries to 
minimize block size in order to make temporaries as small as possible 
(so avoiding the use of the higher level caches).  From this tension 
(and some benchmarking work) the size of 4K (btw, this is the number of 
*elements*, so the size is actually either 16 KB and 32 KB for single 
and double precision respectively) was derived.  Incidentally, for 
numexpr with no VML support, the size is reduced to 1K elements (and 
perhaps it could be reduced a bit more, but anyways).

Anyway, this is way too low level to be discussed here, although we can 
continue on the numexpr list if you are interested in more details.

-- 
Francesc Alted