
On 25/06/15 00:10, Devin Jeanpierre wrote:
So there's two reasons I can think of to use threads for CPU parallelism:
- My thing does a lot of parallel work, and so I want to save on memory by sharing an address space
This only becomes an especially pressing concern if you start running tens of thousands or more of workers. Fork also allows this.
This might not be a valid concern. Sharing address space means sharing *virtual memory*. Presumably what they really want is to save *physical memory*. Two processes can map the same physical memory into virtual memory.
- My thing does a lot of communication, and so I want fast communication through a shared address space
This can become a pressing concern immediately, and so is a more visible issue.
This is a valid argument. It is mainly a concern for those who use deeply nested Python objects though.
On Unix, IPC can be free or cheap due to shared memory.
This is also the case on Windows. IPC mechanisms like pipes, fifos, Unix domain sockets are also very cheap on Unix. Pipes are also very cheap on Windows, as are tcp sockets on localhost. Windows named pipes are similar to Unix domain sockets in performance.
Same applies to strings and other non-compound datatypes. Compound datatypes are hard even for the subinterpreter case, just because the objects you're referring to are not likely to exist on the other end, so you need a real copy.
Yes. With a "share nothing" message-passing approach, one will have to make deep copies of any mutable object. And even though a tuple can be immutable, it could still contain mutable objects. It is really hard to get around the pickle overhead with subinterpreters. Since the pickle overhead is huge compared to the low-level IPC, there is very little to save in this manner.
- separate refcounts replaces refcount with a pointer to refcount, and changes incref/decref. - refcount freezing lets you walk all objects and set the reference count to a magic value. incref/decref check if the refcount is frozen before working.
With freezing, unlike this approach to separate refcounts, anyone that touches the refcount manually will just dirty the page and unfreeze the refcount, rather than crashing the process.
Both of them will decrease performance for non-forking python code,
Freezing has little impact on a modern CPU with branch prediction. On GCC we can also use __builtin_expect to make sure the optimal code is generated. This is a bit similar to using typed memoryviews and NumPy arrays in Cython with and without bounds checking. A pragma like @cython.boundscheck(False) have little benefit for the performance because of the CPU's branch prediction. The CPU knows it can expect the bounds check to pass, and only if it fails will it have to flush the pipeline. But if the bounds check passes the pipeline need not be flushed, and performance wise it will be as if the test were never there. This has greatly improved the last decade, particularly because processors have been optimized for running languages like Java and .NET efficiently. A check for a thawed refcount would be similarly cheap. Keeping reference counts in extra pages could impair performance, but mostly if multiple threads are allowed to access the same page. Because of hierachical memory, the extra pointer lookup should not matter much. Modern CPUs have evolved to solve the aliasing problem that formerly made Fortran code run faster than similar C code. Today C code tends to be faster than similar Fortran. This helps if we keep refcounts in a separate page, and the compiler cannot know what the pointer actually refers to and what it might alias. 10 or 15 years ago it would have been a performance killer, but not today. Sturla