[Python-ideas] solving multi-core Python

Wed Jun 24 18:28:54 CEST 2015

On 24/06/15 07:01, Eric Snow wrote:

> Well, perception is 9/10ths of the law. :)  If the multi-core problem
> is already solved in Python then why does it fail in the court of
> public opinion.  The perception that Python lacks a good multi-core
> story is real, leads organizations away from Python, and will not
> improve without concrete changes.

I think it is a combination of FUD and the lack of fork() on Windows. 
There is a lot of utterly wrong information about CPython and its GIL.

The reality is that Python is used on even the largest supercomputers. 
The scalability problem that is seen on those systems is not the GIL, 
but the module import. If we have 1000 CPython processes importing 
modules like NumPy simultaneously, they will do a "denial of service 
attack" on the file system. This happens when the module importer 
generates a huge number of failed open() calls while trying to locate 
the module files.

There is even described in a paper on how to avoid this on an IBM Blue 
Brain: "As an example, on Blue Gene P just starting up Python and 
importing NumPy and GPAW with 32768 MPI tasks can take 45 minutes!"

http://www.cs.uoregon.edu/research/paracomp/papers/iccs11/iccs_paper_final.pdf

And while CPython is being used for massive parallel computing to e.g. 
model the global climate system, there is this FUD that CPython does not 
even scale up on a laptop with a single multicore CPU. I don't know 
where it is coming from, but it is more FUD than truth.

The main answers to FUD about the GIL and Python in scientific computing 
are these:

1. Python in itself generates a 200x to 2000x performance hit compared 
to C or Fortran. Do not write compute kernels in Python, unless you can 
compile with Cython or Numba. If you have need for speed, start by 
moving the performance critical parts to Cython instead of optimizing 
for a few CPU cores.

2. If you can release the GIL, e.g. in Cython code, Python threads scale 
like any other native OS thread. They are real threads, not fake threads 
in the interpreter.

3. The 80-20, 90-10, or 99-1 rule: The majority of the code accounts for 
a small portion of the runtime. It is wasteful to optimize "everything". 
The more speed you need, the stronger this asymmetry will be. Identify 
the bottlenecks with a profiler and optimize those.

4. Using C or Java does not give you ha faster hard-drive or faster 
network connection. You cannot improve on network access by using 
threads in C or Java instead of threads in Python. If your code is i/o 
bound, Python's GIL does not matter. Python threads do execute i/o tasks 
in parallel. (This is the major misunderstanding.)

5. Computational intensive parts of a program is usually taken case of 
in libraries like BLAS, LAPACK, and FFTW. The Fortran code in LAPACK 
does not care if you called it from Python. It will be as fast as it can 
be, independent of Python. The Fortran code in LAPACK also have no 
concept of Python's GIL. LAPACK libraries like Intel MKL can use threads 
internally without asking Python for permission.

6. The scalability problem when using Python on a massive supercomputer 
is not the GIL but the module import.

7. When using OpenCL we write kernels as plain text. Python is excellent 
at manipulating text, more so than C. This also applies to using OpenGL 
for computer graphics with GLSL shaders and vetexbuffer objects. If you 
need the GPU, you can just as well use Python on the CPU.

Sturla