[Python-ideas] solving multi-core Python

Thu Jun 25 16:08:07 CEST 2015

On 25 June 2015 at 02:28, Sturla Molden <sturla.molden at gmail.com> wrote:
> On 24/06/15 07:01, Eric Snow wrote:
>
>> Well, perception is 9/10ths of the law. :)  If the multi-core problem
>> is already solved in Python then why does it fail in the court of
>> public opinion.  The perception that Python lacks a good multi-core
>> story is real, leads organizations away from Python, and will not
>> improve without concrete changes.
>
>
> I think it is a combination of FUD and the lack of fork() on Windows. There
> is a lot of utterly wrong information about CPython and its GIL.
>
> The reality is that Python is used on even the largest supercomputers. The
> scalability problem that is seen on those systems is not the GIL, but the
> module import. If we have 1000 CPython processes importing modules like
> NumPy simultaneously, they will do a "denial of service attack" on the file
> system. This happens when the module importer generates a huge number of
> failed open() calls while trying to locate the module files.

Slight tangent, but folks hitting this issue on 2.7 may want to
investigate Eric's importlib2: https://pypi.python.org/pypi/importlib2

It switches from stat-based searching for files to the Python 3.3+
model of directory listing based searches, which can (anecdotally)
lead to a couple of orders of magnitude of improvement in startup for
code loading modules from NFS mounts.

> And while CPython is being used for massive parallel computing to e.g. model
> the global climate system, there is this FUD that CPython does not even
> scale up on a laptop with a single multicore CPU. I don't know where it is
> coming from, but it is more FUD than truth.

Like a lot of things in the vast sprawling Python ecosystem, I think
there are aspects of this that are a discoverabiilty problem moreso
than a capability problem. When you're first experimenting with
parallel execution, a lot of the time folks start with computational
problems like executing multiple factorials at once. That's trivial to
do across multiple cores even with a threading model like JavaScript's
worker threads, but can't be done in CPython without reaching for the
multiprocessing module. This is the one place where I'll concede that
folks learning to program on Windows or the JVM and hence getting the
idea that "creating threads is fast, creating processes is slow"
causes problems: folks playing this kind of thing are far more likely
to go "import threading" than they are "import multiprocessing" (and
likewise for the ThreadPoolExecutor vs the ProcessPoolExecutor if
using concurrent.futures), and their reaction when it doesn't work is
far more likely to be "Python can't do this" than it is "I need to do
this differently in Python from the way I do it in
C/C++/Java/JavaScript".

> The main answers to FUD about the GIL and Python in scientific computing are
> these:

It generally isn't scientific programmers I personally hit problems
with (although we have to allow for the fact many of the scientists I
know I met *because* they're Pythonistas). For that use case, there's
not only HPC to point to, but a number of papers that talking about
Cython and Numba in the same breath as C, C++ and FORTRAN, which is
pretty spectacular company to be in when it comes to numerical
computation. Being the fourth language Nvidia supported directly for
CUDA doesn't hurt either.

Instead, the folks that I think have a more valid complaint are the
games developers, and the folks trying to use games development as an
educational tool. They're not doing array based programming the way
numeric programmers are (so the speed of the NumPy stack isn't any
help), and they're operating on shared game state and frequently
chattering back and forth between threads of control, so high overhead
message passing poses a major performance problem.

That does suggest to me a possible "archetypal problem" for the work
Eric is looking to do here: a 2D canvas with multiple interacting
circles bouncing around. We'd like each circle to have its own
computational thread, but still be able to deal with the collision
physics when they run into each other. We'll assume it's a teaching
exercise, so "tell the GPU to do it" *isn't* the right answer
(although it might be an interesting entrant in a zoo of solutions).
Key performance metric: frames per second

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia