[Numpy-discussion] Numpy and OpenMP

Sat Mar 15 21:47:35 EDT 2008

I am forwarding a response from one of my colleagues, Edward Rosten.

Edward Rosten writes:
Anne Archibald wrote:
 > On 15/03/2008, Damian Eads <eads at soe.ucsc.edu> wrote:
 > > Robert Kern wrote:
 > >  > Eric Jones tried to use multithreading to split the computation
 > >  > of ufuncs across CPUs. Ultimately, the overhead of locking and
 > >  > unlocking made it prohibitive for medium-sized arrays and only
 > >  > somewhat disappointing improvements in performance for quite
 > >  > large arrays. I'm not familiar enough with OpenMP to determine if
 > >  > this result would be applicable to it. If you would like to try,
 > >  > we can certainly give you pointers as to where to start.
 > >
 > > Perhaps I'm missing something. How is locking and synchronization an
 > > issue when each thread is writing to a mutually exclusive part of
 > > the output buffer?
 >
 > The trick is to efficiently allocate these output buffers. If you
 > simply give each thread 1/n th of the job, if one CPU is otherwise
 > occupied it doubles your computation time.

I do not see how this is the case. There will be a small amount of time
spent switching between threads at the OS level if the OS has to run
more than one thread. However, in my experience using far more threads
than CPUs has little impact on performance.

 > If you break the job into
 > many pieces and let threads grab them, you need to worry about locking
 > to keep two threads from grabbing the same piece of data. '

If you split the job in to as many pieces as threads and then hand the
pieces to the threads and then run the threads, there is no problem with
the threads grabbing the data. If you split the job up in to many more
pieces than threads, then you have to deal with handing out bits to
threads whenever they finish. The synchronization for this is not
difficult:

I have used the architecture where you have a "global" message queue,
which has mutexes around all operations. All jobs are put in to the
queue and all threads try to extract jobs from the queue. Attempting to
read from an empty queue blocks. In C++ with posix thread primitives,
the code for the message queue is:

template<class C> class MessageQueue
{
     public:
         MessageQueue()
         {
             sem_init(&empty_slots, 0, 0);
         }

         ~MessageQueue()
         {
             sem_destroy(&empty_slots);
         }

         void write(const C& message)
         {
             //Lock the queue, so it can be safely used.
             queue_mutex.lock();
             queue.push_back(message);
             queue_mutex.unlock();

             sem_post(&empty_slots);
         }

         C read()
         {
             sem_wait(&empty_slots);
             C ret;

             queue_mutex.lock();
             ret = queue.front();
             queue.pop_front();
             queue_mutex.unlock();

             return ret;
         }

     private:
         Synchronized queue_mutex;
         deque<C> queue;
         sem_t empty_slots;
};

C is typically a class which derives from:

struct Runnable
{
     virtual void run()=0;
     virtual ~Runnable{};
};

This simple abstract class allows the threads reading from the queue to
know how to execute the code associated with the message.

However, in this case, there is no real need for this. If you have a
pool of N threads, you can simply split the job in to N chunks and hand
one chunk to each thread.

 > Plus,
 > depending on where things are in memory you can kill performance by
 > abusing the caches (maintaining cache consistency across CPUs can be a
 > challenge).

 > Plus a certain amount of numpy code depends on order of
 > evaluation:
 >
 > a[:-1] = 2*a[1:]
 >
 > Correctly handling all this can take a lot of overhead, and require a
 > lot of knowledge about hardware.

The easiest solution to the case above is to simply not optimize those
cases. This is the approach all C compilers use. If one can detect this,
then choosing not to optimize can be handled automatically.  If it can't
be handled automatically, then you have to make the user promise not to
abuse aliasing.

 > OpenMP tries to take care of some of
 > this in a way that's easy on the programmer.

IIRC OpenMP uses essentially the fork/join structure where a task is run
in parallel, the program waits for all threads to finish, then
continues. I've used this style (with pthreads) and it makes
synchronization hard to mess up. However, you also have to promise to
the compiler that your arrays don't alias in nasty ways.

-Ed

 > To answer the OP's question, there is a relatively small number of C
 > inner loops that could be marked up with OpenMP #pragmas to cover most
 > matrix operations. Matrix linear algebra is a separate question, since
 > numpy/scipy prefers to use optimized third-party libraries - in these
 > cases one would need to use parallel linear algebra libraries (which
 > do exist, I think, and are plug-compatible). So parallelizing numpy is
 > probably feasible, and probably not too difficult, and would be
 > valuable. The biggest catch, I think, would be compilation issues - is
 > it possible to link an OpenMP-compiled shared library into a normal
 > executable?
 >
 > Anne

Anne Archibald wrote:
> On 15/03/2008, Damian Eads <eads at soe.ucsc.edu> wrote:
>> Robert Kern wrote:
>>  > Eric Jones tried to use multithreading to split the computation of
>>  > ufuncs across CPUs. Ultimately, the overhead of locking and unlocking
>>  > made it prohibitive for medium-sized arrays and only somewhat
>>  > disappointing improvements in performance for quite large arrays. I'm
>>  > not familiar enough with OpenMP to determine if this result would be
>>  > applicable to it. If you would like to try, we can certainly give you
>>  > pointers as to where to start.
>>
>> Perhaps I'm missing something. How is locking and synchronization an
>>  issue when each thread is writing to a mutually exclusive part of the
>>  output buffer?
> 
> The trick is to efficiently allocate these output buffers. If you
> simply give each thread 1/n th of the job, if one CPU is otherwise
> occupied it doubles your computation time. If you break the job into
> many pieces and let threads grab them, you need to worry about locking
> to keep two threads from grabbing the same piece of data. Plus,
> depending on where things are in memory you can kill performance by
> abusing the caches (maintaining cache consistency across CPUs can be a
> challenge). Plus a certain amount of numpy code depends on order of
> evaluation:
> 
> a[:-1] = 2*a[1:]
> 
> Correctly handling all this can take a lot of overhead, and require a
> lot of knowledge about hardware. OpenMP tries to take care of some of
> this in a way that's easy on the programmer.
> 
> To answer the OP's question, there is a relatively small number of C
> inner loops that could be marked up with OpenMP #pragmas to cover most
> matrix operations. Matrix linear algebra is a separate question, since
> numpy/scipy prefers to use optimized third-party libraries - in these
> cases one would need to use parallel linear algebra libraries (which
> do exist, I think, and are plug-compatible). So parallelizing numpy is
> probably feasible, and probably not too difficult, and would be
> valuable. The biggest catch, I think, would be compilation issues - is
> it possible to link an OpenMP-compiled shared library into a normal
> executable?
> 
> Anne

-- 

(You can't go wrong with psycho-rats.)(http://mi.eng.cam.ac.uk/~er258)

/d{def}def/f{/Times s selectfont}d/s{11}d/r{roll}d f 2/m{moveto}d -1
r 230 350 m 0 1 179{ 1 index show 88 rotate 4 mul 0 rmoveto}for/s 12
     d f pop 235 420 translate 0 0 moveto 1 2 scale show showpage