[Python-ideas] solving multi-core Python
Sturla Molden
sturla.molden at gmail.com
Thu Jun 25 01:30:21 CEST 2015
On 25/06/15 00:10, Devin Jeanpierre wrote:
> So there's two reasons I can think of to use threads for CPU parallelism:
>
> - My thing does a lot of parallel work, and so I want to save on
> memory by sharing an address space
>
> This only becomes an especially pressing concern if you start running
> tens of thousands or more of workers. Fork also allows this.
This might not be a valid concern. Sharing address space means sharing
*virtual memory*. Presumably what they really want is to save *physical
memory*. Two processes can map the same physical memory into virtual memory.
> - My thing does a lot of communication, and so I want fast
> communication through a shared address space
>
> This can become a pressing concern immediately, and so is a more
> visible issue.
This is a valid argument. It is mainly a concern for those who use
deeply nested Python objects though.
> On Unix, IPC can be free or cheap due to shared memory.
This is also the case on Windows.
IPC mechanisms like pipes, fifos, Unix domain sockets are also very
cheap on Unix.
Pipes are also very cheap on Windows, as are tcp sockets on localhost.
Windows named pipes are similar to Unix domain sockets in performance.
> Same applies to strings and other non-compound datatypes. Compound
> datatypes are hard even for the subinterpreter case, just because the
> objects you're referring to are not likely to exist on the other end,
> so you need a real copy.
Yes.
With a "share nothing" message-passing approach, one will have to make
deep copies of any mutable object. And even though a tuple can be
immutable, it could still contain mutable objects. It is really hard to
get around the pickle overhead with subinterpreters. Since the pickle
overhead is huge compared to the low-level IPC, there is very little to
save in this manner.
> - separate refcounts replaces refcount with a pointer to refcount, and
> changes incref/decref.
> - refcount freezing lets you walk all objects and set the reference
> count to a magic value. incref/decref check if the refcount is frozen
> before working.
>
> With freezing, unlike this approach to separate refcounts, anyone that
> touches the refcount manually will just dirty the page and unfreeze
> the refcount, rather than crashing the process.
>
> Both of them will decrease performance for non-forking python code,
Freezing has little impact on a modern CPU with branch prediction. On
GCC we can also use __builtin_expect to make sure the optimal code is
generated.
This is a bit similar to using typed memoryviews and NumPy arrays in
Cython with and without bounds checking. A pragma like
@cython.boundscheck(False) have little benefit for the performance
because of the CPU's branch prediction. The CPU knows it can expect the
bounds check to pass, and only if it fails will it have to flush the
pipeline. But if the bounds check passes the pipeline need not be
flushed, and performance wise it will be as if the test were never
there. This has greatly improved the last decade, particularly because
processors have been optimized for running languages like Java and .NET
efficiently. A check for a thawed refcount would be similarly cheap.
Keeping reference counts in extra pages could impair performance, but
mostly if multiple threads are allowed to access the same page. Because
of hierachical memory, the extra pointer lookup should not matter much.
Modern CPUs have evolved to solve the aliasing problem that formerly
made Fortran code run faster than similar C code. Today C code tends to
be faster than similar Fortran. This helps if we keep refcounts in a
separate page, and the compiler cannot know what the pointer actually
refers to and what it might alias. 10 or 15 years ago it would have been
a performance killer, but not today.
Sturla
More information about the Python-ideas
mailing list