[Python-ideas] solving multi-core Python

Thu Jun 25 01:30:21 CEST 2015

On 25/06/15 00:10, Devin Jeanpierre wrote:

> So there's two reasons I can think of to use threads for CPU parallelism:
>
> - My thing does a lot of parallel work, and so I want to save on
> memory by sharing an address space
>
> This only becomes an especially pressing concern if you start running
> tens of thousands or more of workers. Fork also allows this.

This might not be a valid concern. Sharing address space means sharing 
*virtual memory*. Presumably what they really want is to save *physical 
memory*. Two processes can map the same physical memory into virtual memory.

> - My thing does a lot of communication, and so I want fast
> communication through a shared address space
>
> This can become a pressing concern immediately, and so is a more
> visible issue.

This is a valid argument. It is mainly a concern for those who use 
deeply nested Python objects though.

> On Unix, IPC can be free or cheap due to shared memory.

This is also the case on Windows.

IPC mechanisms like pipes, fifos, Unix domain sockets are also very 
cheap on Unix.

Pipes are also very cheap on Windows, as are tcp sockets on localhost. 
Windows named pipes are similar to Unix domain sockets in performance.

> Same applies to strings and other non-compound datatypes. Compound
> datatypes are hard even for the subinterpreter case, just because the
> objects you're referring to are not likely to exist on the other end,
> so you need a real copy.

Yes.

With a "share nothing" message-passing approach, one will have to make 
deep copies of any mutable object. And even though a tuple can be 
immutable, it could still contain mutable objects. It is really hard to 
get around the pickle overhead with subinterpreters. Since the pickle 
overhead is huge compared to the low-level IPC, there is very little to 
save in this manner.

> - separate refcounts replaces refcount with a pointer to refcount, and
> changes incref/decref.
> - refcount freezing lets you walk all objects and set the reference
> count to a magic value. incref/decref check if the refcount is frozen
> before working.
>
> With freezing, unlike this approach to separate refcounts, anyone that
> touches the refcount manually will just dirty the page and unfreeze
> the refcount, rather than crashing the process.
>
> Both of them will decrease performance for non-forking python code,

Freezing has little impact on a modern CPU with branch prediction. On 
GCC we can also use __builtin_expect to make sure the optimal code is 
generated.

This is a bit similar to using typed memoryviews and NumPy arrays in 
Cython with and without bounds checking. A pragma like 
@cython.boundscheck(False) have little benefit for the performance 
because of the CPU's branch prediction. The CPU knows it can expect the 
bounds check to pass, and only if it fails will it have to flush the 
pipeline. But if the bounds check passes the pipeline need not be 
flushed, and performance wise it will be as if the test were never 
there. This has greatly improved the last decade, particularly because 
processors have been optimized for running languages like Java and .NET 
efficiently. A check for a thawed refcount would be similarly cheap.

Keeping reference counts in extra pages could impair performance, but 
mostly if multiple threads are allowed to access the same page. Because 
of hierachical memory, the extra pointer lookup should not matter much. 
Modern CPUs have evolved to solve the aliasing problem that formerly 
made Fortran code run faster than similar C code. Today C code tends to 
be faster than similar Fortran. This helps if we keep refcounts in a 
separate page, and the compiler cannot know what the pointer actually 
refers to and what it might alias. 10 or 15 years ago it would have been 
a performance killer, but not today.

Sturla