Re: [Python-ideas] solving multi-core Python

June 24, 2015

      On 25/06/15 00:10, Devin Jeanpierre wrote:
...
So there's two reasons I can think of to use threads for CPU parallelism:
- My thing does a lot of parallel work, and so I want to save on
memory by sharing an address space
This only becomes an especially pressing concern if you start running
tens of thousands or more of workers. Fork also allows this.
This might not be a valid concern. Sharing address space means sharing 
*virtual memory*. Presumably what they really want is to save *physical 
memory*. Two processes can map the same physical memory into virtual memory.
...
- My thing does a lot of communication, and so I want fast
communication through a shared address space
This can become a pressing concern immediately, and so is a more
visible issue.
This is a valid argument. It is mainly a concern for those who use 
deeply nested Python objects though.
...
On Unix, IPC can be free or cheap due to shared memory.
This is also the case on Windows.

IPC mechanisms like pipes, fifos, Unix domain sockets are also very 
cheap on Unix.

Pipes are also very cheap on Windows, as are tcp sockets on localhost. 
Windows named pipes are similar to Unix domain sockets in performance.
...
Same applies to strings and other non-compound datatypes. Compound
datatypes are hard even for the subinterpreter case, just because the
objects you're referring to are not likely to exist on the other end,
so you need a real copy.
Yes.

With a "share nothing" message-passing approach, one will have to make 
deep copies of any mutable object. And even though a tuple can be 
immutable, it could still contain mutable objects. It is really hard to 
get around the pickle overhead with subinterpreters. Since the pickle 
overhead is huge compared to the low-level IPC, there is very little to 
save in this manner.
...
- separate refcounts replaces refcount with a pointer to refcount, and
changes incref/decref.
- refcount freezing lets you walk all objects and set the reference
count to a magic value. incref/decref check if the refcount is frozen
before working.
With freezing, unlike this approach to separate refcounts, anyone that
touches the refcount manually will just dirty the page and unfreeze
the refcount, rather than crashing the process.
Both of them will decrease performance for non-forking python code,
Freezing has little impact on a modern CPU with branch prediction. On 
GCC we can also use __builtin_expect to make sure the optimal code is 
generated.

This is a bit similar to using typed memoryviews and NumPy arrays in 
Cython with and without bounds checking. A pragma like 
@cython.boundscheck(False) have little benefit for the performance 
because of the CPU's branch prediction. The CPU knows it can expect the 
bounds check to pass, and only if it fails will it have to flush the 
pipeline. But if the bounds check passes the pipeline need not be 
flushed, and performance wise it will be as if the test were never 
there. This has greatly improved the last decade, particularly because 
processors have been optimized for running languages like Java and .NET 
efficiently. A check for a thawed refcount would be similarly cheap.

Keeping reference counts in extra pages could impair performance, but 
mostly if multiple threads are allowed to access the same page. Because 
of hierachical memory, the extra pointer lookup should not matter much. 
Modern CPUs have evolved to solve the aliasing problem that formerly 
made Fortran code run faster than similar C code. Today C code tends to 
be faster than similar Fortran. This helps if we keep refcounts in a 
separate page, and the compiler cannot know what the pointer actually 
refers to and what it might alias. 10 or 15 years ago it would have been 
a performance killer, but not today.

Sturla

Re: [Python-ideas] solving multi-core Python

Sturla Molden