Multiprocessing, shared memory vs. pickled copies

Sat Apr 9 13:15:49 EDT 2011

On 9 apr, 09:36, John Ladasky <lada... at my-deja.com> wrote:

> Thanks for finding my discussion!  Yes, it's about passing numpy
> arrays to multiple processors.  I'll accomplish that any way that I
> can.

My preferred ways of doing this are:

1. Most cases for parallel processing are covered by libraries, even
for neural nets. This particularly involves linear algebra solvers and
FFTs, or calling certain expensive functions (sin, cos, exp) over and
over again. The solution here is optimised LAPACK and BLAS (Intel MKL,
AMD ACML, GotoBLAS, ATLAS, Cray libsci), optimised FFTs (FFTW, Intel
MKL, ACML), and fast vector math libraries (Intel VML, ACML). For
example, if you want to make multiple calls to the function "exp",
there is a good chance you want to use a vector math library. Despite
of this, most Python programmers' instinct seems to be to use multiple
processes with numpy.exp or math.exp, or use multiple threads in C
with exp from libm (cf. math.h). Why go through this pain when a
single function call to Intel VML or AMD ACML (acml-vm) will be much
better? It is common to see scholars argue that "yes but my needs are
so special that I need to customise everything myself." Usually this
translates to "I don't know these libraries (not even that they exist)
and are happy to reinvent the wheel."  Thus, if you think you need to
use manually managed threads or processes for parallel technical
computing, and even contemplate that the GIL might get in your way,
there is a 99% chance you are wrong. You will almost ALWAYS want ot
use a fast library, either directly in Python or linked to your own
serial C or Fortran code. You have probably heard that "premature
optimisation is the root of all evil in computer programming." It
particularly applies here. Learn to use the available performance
libraires, and it does not matter from which language they are called
(C or Fortran will not be faster than Python!) This is one of the
major reasons Python can be used for HPC (high-performance computing)
even though the Python part itself is "slow". Most of these libraires
are available for free (GotoBLAS, ATLAS, FFTW, ACML), but Intel MKL
and VML require a license fee.

Also note that there are comprehensive numerical libraries you can
use, which can be linked with multi-threaded performance libraries
under the hood. Of particular interest are the Fortran libraries from
NAG and IMSL, which are the two gold standards of technical computing.
Also note that the linear algebra solvers of NumPy and SciPy in
Enthought Python Distribution are linked with Intel MKL. Enthought's
license are cheaper than Intel's, and you don't need to build NumPy or
SciPy against MKL yourself. Using scipy.linalg from EPD is likely to
cover your need for parallel computing with neural nets.

2. Use Cython, threading.Thread, and release the GIL. Perhaps we
should have a cookbook example in scipy.org on this. In the "nogil"
block you can call a library or do certain things that Cython allows
without holding the GIL.

3. Use C, C++ or Fortran with OpenMP, and call these using Cython,
ctypes or f2py. (I prefer Cython and Fortran 95, but any combination
will do.)

4. Use mpi4py with any MPI implementation.

Note that 1-4 are not mutually exclusive, you can always use a certain
combination.

> I will retain a copy of YOUR shmarray code (not the Bitbucket code)
> for some time in the future.  I anticipate that my arrays might get
> really large, and then copying them might not be practical in terms of
> time and memory usage.

The expensive overhead in passing a NymPy array to
multiprocessing.Queue is related to pickle/cPickle, not IPC or making
a copy of the buffer.

For any NumPy array you can afford to copy in terms of memory, just
work with copies.

The shared memory arrays I made are only useful for large arrays. They
are just as expensive to pickle in terms of time, but can be
inexpensive in terms of memory.

Also beware that the buffer is not copied to the pickle, so you need
to call .copy() to pickle the contents of the buffer.

But again, I'd urge you to consider a library or threads
(threading.Thread in Cython or OpenMP) before you consider multiple
processes. The reason I have not updated the sharedmem arrays for two
years is that I have come to the conclusion that there are better ways
to do this (paricularly vendor tuned libraries). But since they are
mostly useful with 64-bit (i.e. large arrays), I'll post an update
soon.

If you decide to use a multithreaded solution (or shared memory as
IPC), beware of "false sharing". If multiple processors write to the
same cache line (they can be up to 512K depending on hardware), you'll
create an invisible "GIL" that will kill any scalability. That is
because dirty cache lines need to be synchonized with RAM. "False
sharing" is one of the major reasons that "home-brewed" compute-
intensive code will not scale.

It is not uncommon to see Java programmers complain about Python's
GIL, and then they go on to write i/o bound or false shared code. Rest
assured that multi-threaded Java will not scale better than Python in
these cases :-)

Regards,
Sturla