[Python-ideas] solving multi-core Python
sturla.molden at gmail.com
Wed Jun 24 18:28:54 CEST 2015
On 24/06/15 07:01, Eric Snow wrote:
> Well, perception is 9/10ths of the law. :) If the multi-core problem
> is already solved in Python then why does it fail in the court of
> public opinion. The perception that Python lacks a good multi-core
> story is real, leads organizations away from Python, and will not
> improve without concrete changes.
I think it is a combination of FUD and the lack of fork() on Windows.
There is a lot of utterly wrong information about CPython and its GIL.
The reality is that Python is used on even the largest supercomputers.
The scalability problem that is seen on those systems is not the GIL,
but the module import. If we have 1000 CPython processes importing
modules like NumPy simultaneously, they will do a "denial of service
attack" on the file system. This happens when the module importer
generates a huge number of failed open() calls while trying to locate
the module files.
There is even described in a paper on how to avoid this on an IBM Blue
Brain: "As an example, on Blue Gene P just starting up Python and
importing NumPy and GPAW with 32768 MPI tasks can take 45 minutes!"
And while CPython is being used for massive parallel computing to e.g.
model the global climate system, there is this FUD that CPython does not
even scale up on a laptop with a single multicore CPU. I don't know
where it is coming from, but it is more FUD than truth.
The main answers to FUD about the GIL and Python in scientific computing
1. Python in itself generates a 200x to 2000x performance hit compared
to C or Fortran. Do not write compute kernels in Python, unless you can
compile with Cython or Numba. If you have need for speed, start by
moving the performance critical parts to Cython instead of optimizing
for a few CPU cores.
2. If you can release the GIL, e.g. in Cython code, Python threads scale
like any other native OS thread. They are real threads, not fake threads
in the interpreter.
3. The 80-20, 90-10, or 99-1 rule: The majority of the code accounts for
a small portion of the runtime. It is wasteful to optimize "everything".
The more speed you need, the stronger this asymmetry will be. Identify
the bottlenecks with a profiler and optimize those.
4. Using C or Java does not give you ha faster hard-drive or faster
network connection. You cannot improve on network access by using
threads in C or Java instead of threads in Python. If your code is i/o
bound, Python's GIL does not matter. Python threads do execute i/o tasks
in parallel. (This is the major misunderstanding.)
5. Computational intensive parts of a program is usually taken case of
in libraries like BLAS, LAPACK, and FFTW. The Fortran code in LAPACK
does not care if you called it from Python. It will be as fast as it can
be, independent of Python. The Fortran code in LAPACK also have no
concept of Python's GIL. LAPACK libraries like Intel MKL can use threads
internally without asking Python for permission.
6. The scalability problem when using Python on a massive supercomputer
is not the GIL but the module import.
7. When using OpenCL we write kernels as plain text. Python is excellent
at manipulating text, more so than C. This also applies to using OpenGL
for computer graphics with GLSL shaders and vetexbuffer objects. If you
need the GPU, you can just as well use Python on the CPU.
More information about the Python-ideas