
Thomas Grainger writes:
Would a work stealing approach work better for you here? Then the only signalling overhead would be when a core runs out of work
Not sure what you're talking about with "work stealing". It sounds conceptually more complex than the queue + worker pool approach, which is already implemented in both the threading and multiprocessing modules. The overhead of creating hundreds of multiprocessing tasks is going to be barely human-perceptible. The other "overhead" is the programmer effort in assembling the finished product (assuming order matters, or there are interdependencies between chunks that require keeping per-chunk state). But I don't see how such programmer effort would be much greater for the "many chunks in a queue" approach vs. the chunk-per-core approach. So it seems to me that multiprocessing with a worker pool is a low programmer effort, very high efficiency gain approach to this problem. The remaining question is "how many chunks?" If that's relevant, ISTM a few simple experiments will show where the sweet spot is. Try a queue of 64 chunks, then 128 chunks, and refine guesses from there. I may be missing something, but that's the thinking that led to my previous post. Steve